We present a runtime system for simple and efficient programming of CPU+GPU clusters. The programmer focuses on core logic, while the system undertakes task allocation, load balancing, scheduling, data transfer, etc. Our programming model is based on a shared global address space, made efficient by transaction style bulk-synchronous semantics. This model broadly targets coarse-grained data parallel computation particularly suited to multi-GPU heterogeneous clusters. We describe our computation and communication scheduling system and report its performance ona few prototype applications. For example, parallelization of matrix multiplication or 2D FFT using our system requires the regular CPU/GPU implementations and about 30 lines of additional C code to set up the runtime. Our runtime system achieves a performance of 5.61 TFlop/s while multiplying two square matrices of 1.56 billion elements each over a 10-nodecluster with 20 GPUs. This performance is possible due toa number of critical optimizations working in concert. These include perfecting, pipelining, maximizing overlap between computation and communication, and scheduling efficiently across heterogeneous devices of vastly different capacities.
An integer linear programming (ILP) model for mapping applications on HW platforms that consist of muPs, ASICs and FPGAs is proposed. The introduced model solves the assignment and scheduling problems taking into consideration the time required to reconfigure the FPGAs. Specifically, the type of the tasks that are executed on the FPGAs is taken into account, so that tasks which perform the same function are scheduled consecutively. In that way the number of the FPGAs' reconfigurations is reduced and the performance is improved. Also, the configuration of the FPGAs is hidden because it happens, if it is possible, in the time intervals that the FPGAs are idle. Thus, the execution time of the task mapped on the FPGA does not increase due to reconfiguration time and performance is improved. Moreover, the memory requirements for storing the program and configuration codes of the tasks that are executed on muPs and FPGAs are taken into consideration. In addition, resource conflicts due to tasks/data transfers whose execution times are overlapped in time and are assigned on the same resource (PE/bus) are addressed. Finally, resource sharing is supported. The above features are validated by a series of experiments including a real-application example, the M-JPEG encoder. Also, the complexity of the model is studied in terms of the number of the generated constraints and variables and time required for solving it.
The best mapping of a task to one or more processing units in a heterogeneous system depends on multiple variables. Several approaches based on runtime systems have been proposed that determine the best mapping under given circumstances automatically. Some of them also consider dynamic events like varying problem sizes or resource competition that may change the best mapping during application runtime but only a few even consider that task execution may fail. While aging or overheating are well-known causes for sudden faults, the ongoing miniaturization and the growing complexity of heterogeneous computing are expected to create further threats for successful application execution. However, if properly incorporated, heterogeneous systems also offer the opportunity to recover from different types of faults in hardware as well as in software. In this work, we propose a combination of both topics, dynamic performance-oriented task mapping and dependability, to leverage this opportunity. As we will show, this combination not only enables tolerating faults in hardware and software with minor assistance of the developer, it also provides benefits for application development itself and for application performance in case of faults due to a new metric and automatic data management.
This paper presents a compositional performance analysis technique, enabling predictable deployment of software components on heterogeneous multiprocessor architectures. This analysis technique introduces (a) composable software and hardware component models representing abstract specification of the component behaviour and corresponding resources, (b) operational semantics enabling composition of the models into an executable system model, and (c) simulation-based analysis of the obtained executable model resulting in predicted performance attributes. Example attributes are response time, throughput, utilization of processors, memory and communication lines. Special attention is paid to modeling of both passive and active components exploiting synchronous method invocation and asynchronous message passing interaction. We made an experimental validation of the above framework for two case studies: an MPEG-4 decoder and a car navigation system. It was found that the prediction error on task latencies and processor usage was within 10%
By scheduling multiple applications with complementary resource requirements on a smaller number of compute nodes, we aim to improve performance, resource utilization, energy consumption, and energy efficiency simultaneously. In addition to our naive consolidation approach, which already achieves the aforementioned goals, we propose a new energy efficiency-aware (EEA) scheduling policy and compare its performance with current state-of-the-art policies, namely round-robin (RR), resource utilization-aware (RUA), adaptive shortest-job first (ASJF) in order to support the consolidation of applications in heterogeneous computing systems, and in turn, simultaneously improve performance, resource utilization, energy consumption, and energy efficiency, as measured by the energy-delay product. Of particular note, our experimental results on a real heterogeneous computing system demonstrate the efficacy of our scheduling policies by improving overall energy efficiency by an order of magnitude.
OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2’s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using OP2 is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Self-powered vehicles that interact with the physical world, such as spacecraft, require computing platforms with predictable timing behavior and a low energy demand. Energy consumption can be reduced by choosing energy-efficient designs for both hardware and software components of the platform. We leverage the state-of-the-art in energy-efficient hardware design by adopting Heterogeneous Multi-core Processors with support for Dynamic Voltage and Frequency Scaling and Dynamic Power Management. We address the problem of allocating real-time software components onto heterogeneous cores such that total energy is minimized. Our approach is to start from an analytically justified target load distribution and find a task assignment heuristic that approximates it. Our analysis shows that neither balancing the load nor assigning all load to the &#x201C;cheapest&#x201D; core is the best load distribution strategy, unless the cores are extremely alike or extremely different. The optimal load distribution is then formulated as a solution to a convex optimization problem. A heuristic that approximates this load distribution and an alternative method that leverages the solution explicitly are proposed as viable task assignment methods. The proposed methods are compared to state-of-the-art on simulated problem instances and in a case study of a soft-real-time application on an off-the-shelf ARM big.LITTLE heterogeneous processor.
Multiprocessor system-on-chip (MPSoC) designs offer a lot of computational power assembled in a compact design. The computing power of MPSoCs can be further augmented by adding massively parallel processor arrays (MPPA) and specialized hardware with instruction-set extensions. On-chip MPPAs can be used to accelerate low-level image-processing algorithms with massive inherent parallelism. However, the presence of multiple processing elements (PEs) with different characteristics raises issues related to programming and application mapping, among others. The conventional approach used for programming heterogeneous MPSoCs results in a static mapping of various parts of the application to different PE types, based on the nature of the algorithm and the structure of the PEs. Yet, such a mapping scheme independent of the instantaneous load on the PEs may lead to under-utilization of some type of PEs while overloading others. In this work, we investigate the benefits of using a heterogeneous MPSoC for accelerating various stages within a real-world image-processing algorithm for object-recognition. A case study demonstrates that a resource-aware programming model called Invasive Computing helps to improve the throughput and worst observed latency of the application program, by dynamically mapping applications to different types of PEs available on a heterogeneous MPSoC.
In this paper, we present a multi-paradigm and multi-grain parallel component model. It is an extension to the Common Component Architecture (CCA). Components have two kinds of paradigms, running paradigms and programming paradigms. Running paradigms can be serial execution, message passing parallel, or memory sharing parallel. Programming paradigms can be the programming languages the components use. The grain of a component can be coarse, middle, or fine. We built a resource management system to manage our heterogeneous platforms. We gave a components schedule policy. It is based on the paradigms and grains descriptions of components. It also uses resources information. This policy improves the performance of CCA parallel components applications. And it raises the utilization of heterogeneous platforms. © 2011 IEEE.
Future spaceborne platforms will require expanded onboard processing payloads to meet increasing mission performance and autonomy requirements. Recently proposed spacecraft systems plan to deploy networked processors configured much like commodity clusters for high-performance computing (HPC). Just as robust job management services have been developed and are required to optimize the performance of ground-based systems, so too will spaceborne clusters require similar management services, especially to meet real-time mission deadlines. In order to gain insight into how best to address the challenge of job management in high-performance, embedded space systems, a management service has been developed for a NASA New Millennium Program (NMP) experiment for the ST-8 mission slated for launch in 2009. This paper presents an overview and analysis of the effects on overall mission performance of adding priority and preemption to a baseline gang scheduler employing opportunistic load-balancing (OLE) on a heterogeneous processing system for space. Experiments are conducted with two mission scenarios including planetary mapping and object tracking
Modern multiprocessor systems-on-chip (MPSoCs) are expected to handle multi-application use cases. As the number and complexity of these applications scale, resource allocation to meet the application throughput requirement is becoming quite a challenge. In this paper, a complete design flow is proposed for partially reconfigurable heterogeneous MPSoC platforms. The proposed flow determines the minimum resources required to map and guarantee the throughput of applications in all use-cases. Further, a suitable mapping for each application is chosen so that energy consumption is minimized. Experiments conducted with a set of synthetic benchmarks and real-life applications clearly demonstrate the advantage of our approach over homogeneous or fully reconfigurable designs. The proposed design flow achieves more than 50% energy savings when the number of configurations is not optimized. With configuration-optimization, our flow results in 75% reduction in the number of configurations with 5% reduction in energy.
In this paper, we present a domain-specific language, referred to as OptiSDR, that matches high level digital signal processing (DSP) routines for software defined radio (SDR) to their generic parallel executable patterns targeted to heterogeneous computing architectures (HCAs). These HCAs includes a combination of hybrid GPU-CPU and DSP-FPGA architectures that are programmed using different programming paradigms such as C/C++, CUDA, OpenCL, and/or VHDL. OptiSDR presents an intuitive single high-level source code and near specification-level approach for optimization and facilitation of HCAs. OptiSDR uses an optimized embedded domain-specific language (DSL) compiler framework called Delite. Our focus is on the programming language expressiveness for parallel programming and optimization of typical DSP algorithms for deployment on SDR HCAs. We demonstrate the capability of OptiSDR to express the solution to the issues of parallel DSP low-level implementation complexities in the closest way to the original parallel programming of SDR systems. This paper will achieve these by focusing on three generic parallel executable patterns suitable for DSP routines such as cross-correlation, convolution in FIR filter based Hilbert transformers, and fast Fourier transforms for spectral analysis. This paper concludes with a performance analysis using DSP algorithms that tests automatically generated code against hand-crafted solutions.
Given a collection of documents residing on a disk, we develop a new strategy for processing these documents and building the inverted files extremely quickly. Our approach is tailored for a heterogeneous platform consisting of multicore CPUs and highly multithreaded GPUs. Our algorithm is based on a number of novel techniques, including a high-throughput pipelined strategy, a hybrid trie and B-tree dictionary data structure, dynamic work allocation to CPU and GPU threads, and optimized CUDA indexer implementation. We have performed extensive tests of our algorithm on a single node (two Intel Xeon X5560 Quad-core CPUs) with two NVIDIA Tesla C1060 GPUs attached to it, and were able to achieve a throughput of more than 262 MB/s on the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those run on large clusters. © 2012 Elsevier Inc. All rights reserved.
Hybrid computing systems consisting of multiple platform types (e.g., general purpose processors, FPGAs etc.) are increasingly being used to achieve higher performance and lower costs than can be obtained with homogeneous systems (e.g., processor clusters). Different platforms have different languages and simulators associated with them. Auto-pipe has been developed as a toolset to reduce the complexity inherent in deploying an application to a diverse resource set. In Auto-Pipe, applications are expressed using the data flow coordination language X, which describes the application in terms of interactions between functional blocks. As part of the Auto-Pipe system, X-sim has been developed as a federated distributed simulator that can be used to conveniently and efficiently simulate applications. After a short introduction to Auto-Pipe and the X language, this paper considers issues involved with total system simulation of an application mapped to a hybrid resource set. The paper then demonstrates the use of X-sim with a realtime signal processing application employed in the VERITAS gamma-ray astronomy project.
Heterogeneous multi-target platforms composed of processors, FPGAs, and specialized I/O are popular targets for embedded applications. Model based design approaches are increasingly used to deploy high performance concurrent applications on these platforms. In addition to programmability and performance, embedded systems need to ensure reliability and availability in safety critical environments. However, prior design approaches do not sufficiently characterize these non-functional requirements in the application or in the mapping on the multitarget platform. In this work, we present a design methodology and associated run-time environment for programmable heterogeneous multi-target platforms that enable design of reliable systems by: (a) elevating reliability concerns to the system modeling level, so that a domain expert can capture reliability requirements within a formal model of computation, (b) modeling platform elements that can be automatically composed into systems to provide a reliable architecture for deployment, and (c) segmenting (in space and time) the run-time environment such that the system captures independent end-user provided reliability criteria. We illustrate the modeling, analysis, and implementation capabilities of our methodology to design fault tolerant control applications. Using the National Instruments PXIe platform and FlexRIO components, we demonstrate a runtime environment that provides desired levels of reliability.
Recent advances in graphics processing units (GPUs) technology open a new era in high performance computing. Applications of GPUs to scientific computations are attracting a lot of attention due to their low cost in conjunction with their inherently remarkable performance features and the recently enhanced computational precision and improved programming tools. Domain decomposition methods (DDM) constitute today an important category of methods for the solution of highly demanding problems in simulation-based applied science and engineering. Among them, dual domain decomposition methods have been successfully applied in a variety of problems in both sequential as well as in parallel/distributed processing systems. In this work, we demonstrate the implementation of the FETI method to a hybrid CPU–GPU computing environment. Parametric tests on implicit finite element structural mechanics benchmark problems revealed the tremendous potential of this type of hybrid computing environment as a result of the full exploitation of multi-core CPU hardware resources and the intrinsic software and hardware features of the GPUs as well as the numerical properties of the solution method.
In this paper we present the results obtained designing and implementing a simulator for a hybrid system (named H-system), to investigate the effects on the system load and throughput of a heterogeneous system compared toa conventional one. In fact, the recent advances on Graphical Processing Units (GPUs) and the introduction of a standard language such as OpenCL, which allows to write applications to be executed both on the multicore (CPU) and on the manycore (GPU) architectures, made it possible to use such devices in cooperation with the CPU to execute jobs both on GPUs and CPUs, increasing the global H-system performances. In the present scenario, however the scheduling in a H-system is not well addressed and such type of investigations are very important. In fact, the efficient scheduling of OpenCl jobs in conjunction with ordinary job may dramatically change the future scenario of computing. Our work shows clearly that adopting a proper hardware and software configuration of the H-system (in particular the GPUs/CPUs ratio), increases the computing performances, in terms of mean response time and of the workload balance of the system.
Modern processors have the potential of executing compute-intensive programs quickly and efficiently, but require applications to be adapted to their ever increasing parallelism. Here, heterogeneous systems add complexity by combining processing units with different characteristics. Scheduling should thus consider the performance of each processor as well as competing workloads and varying inputs. To assist programmers of stream processing applications in facing this challenge we present libHawaii, an open source library for cooperatively using all processors of heterogeneous systems easily and efficiently. It supports exploiting data flow, data element and task parallelism via pipelining, partitioning and demand-based allocation of consecutive work items. Scheduling is automatically adapted on-line to continuously optimize performance and energy efficiency. Our C++ library does not depend on specific hardware architectures or parallel computing frameworks. However, it facilitates maximizing the throughput of compatible GPUs by overlapping computations and memory transfers while maintaining low latencies. This paper describes the algorithms and implementation of libHawaii and demonstrates its usage on existing applications. We experimentally evaluate our library using two examples: General matrix multiplication (GEMM) is a simple yet important building block of many high-performance computing applications. Complementarily, the detection, extraction and matching of sparse image features exhibits greater complexity, including indeterministic memory access and synchronization.
For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU–GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy.
DynEarthSol3D (Dynamic Earth Solver in Three Dimensions) is a flexible, open-source finite element solver that models the momentum balance and the heat transfer of elasto-visco-plastic material in the Lagrangian form using unstructured meshes. It provides a platform for the study of the long-term deformation of earth's lithosphere and various problems in civil and geotechnical engineering. However, the continuous computation and update of a very large mesh poses an intolerably high computational burden to developers and users in practice. For example, simulating a small input mesh containing around 3000 elements in 20 million time steps would take more than 10 days on a high-end desktop CPU. In this paper, we explore tightly coupled CPU–GPU heterogeneous processors to address the computing concern by leveraging their new features and developing hardware-architecture-aware optimizations. Our proposed key optimization techniques are three-fold: memory access pattern improvement, data transfer elimination and kernel launch overhead minimization. Experimental results show that our proposed implementation on a tightly coupled heterogeneous processor outperforms all other alternatives including traditional discrete GPU, quad-core CPU using OpenMP, and serial implementations by 67%, 50%, and 154% respectively even though the embedded GPU in the heterogeneous processor has significantly less number of cores than high-end discrete GPU.
Spectral calculation and analysis have very important practical applications in astrophysics. The main portion of spectral calculation is to solve a large number of one-dimensional numerical integrations at each point of a large three-dimensional parameter space. However, existing widely used solutions still remain in process-level parallelism, which is not competent to tackle numerous compute-intensive small integral tasks. This paper presented a GPU-optimized approach to accelerate the numerical integration in massive spectral calculation. We also proposed a load balance strategy on hybrid multiple CPUs and GPUs architecture via share memory to maximize performance. The approach was prototyped and tested on the Astrophysical Plasma Emission Code (APEC), a commonly used spectral toolset. Comparing with the original serial version and the 24 CPU cores (2.5GHz) parallel version, our implementation on 3 Tesla C2075 GPUs achieves a speed-up of up to 300 and 22 respectively.
Massively parallel architectures are mainly based on a parallel heterogeneous setup; they are composed by different computing devices that speedup specific code region named kernel. These kernels are executed offline in the corresponding devices. Porting applications to a specific heterogeneous platform is a costly task in terms of human resources and time-to-market. One of the key points in the porting process is manually analyzing and detecting the kernels in applications. Moreover, each device of these heterogeneous platforms has their own restrictions, such as memory allocation support. The kernels must be mapped with their suitable computing devices. Finally, the memory transfer operations become an important bottleneck in heterogeneous platforms. In order to improve the performance, the transfer operations have to be avoided whenever it is possible. In this paper, we introduce AKI, an automatic kernel identification and annotation tool that aims to identify potential kernels in C++ sequential applications. AKI looks for hotspots that can be offlined on heterogeneous computing devices. Finally, AKI annotates the hotspots as kernels by using REPARA C++ attributes mapping the kernels to the suitable heterogeneous computing devices. This process analyzes not only the kernels itself but also their associated data, including kernels and data interdependencies. These annotations can aid future automatic source-to-source transformation tools for heterogeneous platforms.
The parallel preconditioned conjugate gradient method (CGM) is used in many applications of scientific computing and often has a critical impact on their performance and energy consumption. This article investigates the energy-aware execution of the CGM on multi-core CPUs and GPUs used in an adaptive FEM. Based on experiments, an application-specific execution time and energy model is developed. The model considers the execution speed of the CPU and the GPU, their electrical power, voltage and frequency scaling, the energy consumption of the memory as well as the time and energy needed for transferring the data between main memory and GPU memory. The model makes it possible to predict how to distribute the data to the processing units for achieving the most energy efficient execution: the execution might deploy the CPU only, the GPU only or both simultaneously using a dynamic and adaptive collaboration scheme. The dynamic collaboration enables an execution minimising the execution time. By measuring execution times for every FEM iteration, the data distribution is adapted automatically to changing properties, e.g. the data sizes.
A recent development of heterogeneous platforms (i.e. those containing different types of computational units such as multicore CPUs, GPUs, and FPGAs) has enabled significant improvements in performance for real-time data processing. This potential, however, is still not fully utilized due to the lack of methods for optimal configuration of software; the allocation of different software components to different computational unit types is crucial for getting the maximal utilization of the platform, but for more complex systems it is difficult to find ad-hoc a good enough or the best configuration. With respect to system and user defined constraints, in this paper we are applying analytical hierarchical process and a genetic algorithm to find feasible, locally optimal solution for allocating software components to computational units.
Numerous problems in science and engineering involve discretizing the problem domain as a regular structured grid and make use of domain decomposition techniques to obtain solutions faster using high performance computing. However, the load imbalance of the workloads among the various processing nodes can cause severe degradation in application performance. This problem is exacerbated for the case when the computational workload is non-uniform and the processing nodes have varying computational capabilities. In this paper, we present novel local search algorithms for regular partitioning of a structured mesh to heterogeneous compute nodes in a distributed setting. The algorithms seek to assign larger workloads to processing nodes having higher computation capabilities while maintaining the regular structure of the mesh in order to achieve a better load balance. We also propose a distributed memory (MPI) parallelization architecture that can be used to achieve a parallel implementation of scientific modelling software requiring structured grids on heterogeneous processing resources involving CPUs and GPUs. Our implementation can make use of the available CPU cores and multiple GPUs of the underlying platform simultaneously. Empirical evaluation on real world flood modelling domains on a heterogeneous architecture comprising of multicore CPUs and GPUs suggests that the proposed partitioning approach can provide a performance improvement of up to 8&#x00D7; over a naive uniform partitioning.
Emerging integrated CPU + FPGA hybrid platforms, such as the Extensible Processing Platform architecture from Xilinx [1], offer unprecedented opportunity to achieving both multifunctionality and real-time responsiveness for memory-intensive embedded applications. However, how to cost-effectively synthesize application-specific hardware constructs that fully exploit memory-level parallelism remains to be a key challenge. To address this problem, we propose a new FPGA-based embedded computer architecture, ASTRO (Application-Specific Hardware Traces with Reconfigurable Optimization). Our main contribution is the development of an integrated methodology that focuses on how to construct an application-specific memory access network capable of extracting the maximum amount of memory-level parallelism on a per-application basis. In particular, our proposed ASTRO architecture can (1) perform dynamic memory analysis to maximally extract the target application’s instruction, loop and memory-level parallelism for performance enhancement, (2) synthesize highly efficient accelerators that enable parallelized memory accesses, and therefore (3) accomplish effective data orchestration by utilizing the capabilities of modern FPGA devices: abundant distributed block RAMs and reprogrammability. To empirically validate our ASTRO methodology, we have implemented a baseline embedded processor platform, a conventional CPU + accelerator with a centralized single memory, and a prototype ASTRO machine based on Xilinx MicroBlaze technology. Our experimental results show that on average for 10 benchmark applications from SPEC2006 and MiBench [2], the ASTRO machine achieves 8.6 times speedup compared to the baseline embedded processor platform and 1.7 times speedup compared to a conventional CPU + accelerator platform. More interestingly, the ASTRO platform achieves more than 40% reduction in energy-delay product compared to a conventional CPU + accelerator with a centralized memory.
This work analyses two techniques for auto-tuning linear algebra routines for hybrid combinations of multicore CPU and manycore coprocessors (single or multiple GPUs and MIC). The first technique is based on basic models of the execution time of the routines, whereas the second one manages only empirical information obtained during the installation of the routines. The final goal in both cases is to obtain a balanced assignation of the work to the computing components in the system. The study is carried out with a basic kernel (matrix-matrix multiplication) and a higher level routine (LU factorization) which uses the auto-tuned basic routine. Satisfactory results are obtained, with experimental execution times close to the lowest experimentally achievable. [All rights reserved Elsevier].
Using multiple accelerators, such as GPUs or Xeon Phis, is attractive to improve the performance of large data parallel applications and to increase the size of their workloads. However, writing an application for multiple accelerators remains today challenging because going from a single accelerator to multiple ones indeed requires to deal with potentially non-uniform domain decomposition, inter-accelerator data movements, and dynamic load balancing. Writing such code manually is time consuming and error-prone. In this paper, we propose a new programming tool called STEPOCL along with a new domain specific language designed to simplify the development of an application for multiple accelerators. We evaluate both the performance and the usefulness of STEPOCL with three applications and show that: (i) the performance of an application written with STEPOCL scales linearly with the number of accelerators, (ii) the performance of an application written using STEPOCL competes with a handwritten version, (iii) larger workloads run on multiple devices that do not fit in the memory of a single device, (iv) thanks to STEPOCL, the number of lines of code required to write an application for multiple accelerators is roughly divided by ten.
The proposed approach presents a method for automatically synthesizing the SW code of complex embedded systems from a model-driven system specification. The solution is oriented to enabling easy exploration and design of different allocations of SW components in heterogeneous platforms, minimizing designer effort. The system is initially described following the UML/MARTE standard. Applying this standard, the system is modeled, describing its components, interfaces and communication links, the system memory spaces, the resource allocations and the HW architecture. From that information, a SW infrastructure containing the communication infrastructure is generated ad-hoc for the system depending on the HW architecture and the resource allocations evaluated. The consequent communication overhead reduction can result in an important advantage for system performance optimization. © 2014 Elsevier B.V. All rights reserved.
HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present multiple novel techniques to balance the loads between the store-poor GPU and the store-rich CPU, and overlap the collaborative computation and communication as far as possible. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per Tianhe-1A node for HOSTA by 2.3X, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 Tianhe-1A nodes. With our method, we have successfully simulated China's large civil airplane configuration C919 containing 150M grid cells. To our best knowledge, this is the first paper that reports a CPUGPU collaborative high-order accurate aerodynamic simulation result with such a complex grid geometry.
This paper proposes and evaluates a strategy to run Biological Sequence Comparison applications on hybrid platforms composed of GPUs and multicores with SIMD extensions. Our strategy provides multiple task allocation policies and the user can choose the one which is more appropriate to his/her problem. We also propose a workload adjustment mechanism that tackles situations that arise when slow nodes receive the last tasks. The results obtained comparing query sequences to 5 public genomic databases in a platform composed of 4 GPUs and 2 multicores show that we are able to reduce the execution time with hybrid platforms, when compared to the GPU-only solution. We also show that our workload adjustment technique can provide significant performance gains in our target platforms. © 2013 IEEE.
We consider the problem of allocating and scheduling dense linear application on fully heterogeneous platforms made of CPUs and GPUs. More specifically, we focus on the Cholesky factorization since it exhibits the main features of such problems. Indeed, the relative performance of CPU and GPU highly depends on the sub-routine: GPUs are for instance much more efficient to process regular kernels such as matrix-matrix multiplications rather than more irregular kernels such as matrix factorization. In this context, one solution consists in relying on dynamic scheduling and resource allocation mechanisms such as the ones provided by PaRSEC or StarPU. In this paper we analyze the performance of dynamic schedulers based on both actual executions and simulations, and we investigate how adding static rules based on an offline analysis of the problem to their decision process can indeed improve their performance, up to reaching some improved theoretical performance bounds which we introduce.
In this paper, we revisit the design and implementation of Branch-and-Bound (B&amp;B) algorithms for solving large combinatorial optimization problems on GPU-enhanced multi-core machines. B&amp;B is a tree-based optimization method that uses four operators (selection, branching, bounding and pruning) to build and explore a highly irregular tree representing the solution space. In our previous works, we have proposed a GPU-accelerated approach in which only a single CPU core is used and only the bounding operator is performed on the GPU device. Here, we extend the approach (LL-GB&amp;B) in order to minimize the CPU–GPU communication latency and thread divergence. Such an objective is achieved through a GPU-based fine-grained parallelization of the branching and pruning operators in addition to the bounding one. The second contribution consists in investigating the combination of a GPU with multi-core processing. Two scenarios have been explored leading to two approaches: a concurrent (RLL-GB&amp;B) and a cooperative one (PLL-GB&amp;B). In the first one, the exploration process is performed concurrently by the GPU and the CPU cores. In the cooperative approach, the CPU cores prepare and off-load to GPU pools of tree nodes using data streaming while the GPU performs the exploration. The different approaches have been extensively experimented on the Flowshop scheduling problem. Compared to a single CPU-based execution, LL-GB&amp;B allows accelerations up to ( × 160) for large problem instances. Moreover, when combining multi-core and GPU, we figure out that using RLL-GB&amp;B is not beneficial while PLL-GB&amp;B enables an improvement up to 36% compared to LL-GB&amp;B.
In the world of mobile and embedded devices, most of which are battery powered, optimizing computations for low energy is becoming increasingly important. One approach to diminished energy consumption is the use of dedicated hardware logic (rather than general-purpose processors) to execute some portion of the application load. Due to the diversity of applications that one may run on the same device, field-programmable gate arrays (FPGAs) are an attractive target since they can readily be reconfigured to implement different functions and are known to provide significant energy savings in certain domains. Unfortunately, FPGAs are difficult to program, typically requiring expertise in hardware description languages. Here we analyze the potential energy benefits from offloading computations to an FPGA device when starting from a high-level language expression of an application in ScalaPipe [1], which is a domain-specific language embedded in the Scala programming language [2] for creating streaming applications on heterogeneous systems consisting of general-purpose processors and FPGAs. We explore the effect of several synthesis optimizations on improving energy usage without sacrificing application performance, concluding that it is possible to reduce energy consumption significantly for computations even when expressed in a high-level language. Here we investigate total energy consumption, which is a combination of the power use and application run time. All of the optimizations considered improve performance, but some also increase power use, which can be a net loss in energy depending on the application.
In a quest to improve system performance, embedded systems are today increasingly relying on heterogeneous platforms that combine different types of processing units such as CPUs, GPUs and FPGAs. However, having better hardware capability alone does not guarantee higher performance, how functionality is allocated onto the appropriate processing units strongly impacts the system performance as well. Yet, with this increase in hardware complexity, finding suitable allocation schemes is becoming a challenge as many new constraints and requirements must now be taken into consideration. In this paper, we present a formal model for allocation optimization of embedded systems which contains a mix of CPU and GPU processing nodes. The allocation takes into consideration the software and hardware architectures, the system requirements and criteria upon which the allocation should be optimized. In its current version, optimized allocation schemes are generated through an integer programming technique to balance the system resource utilization and to optimize the system performance using the GPU resources.
Summary form only given. Given a collection of recurring tasks or processes that comprise the software for an embedded system, and a number of different types of available processing units, the minimum cost synthesis problem is concerned with obtaining an implementation of the embedded system upon a multiprocessor platform comprised of processing units from among the available types, such that the total cost of the platform is minimized. It is shown that this problem is intractable (NP-hard in the strong sense). Approximation algorithms are presented that guarantee to obtain implementations with cost no more than a constant amount greater than twice the cost of an optimal implementation.
On modern GPU clusters, the role of the CPUs is often restricted to controlling the GPUs and handling MPI communication. The unused computing power of the CPUs, however, can be considerable for computations whose performance is bounded by memory traffic. This paper investigates the challenges of simultaneous usage of CPUs and GPUs for computation. Our emphasis is on deriving a heterogeneous CPU+GPU programming approach that combines MPI, OpenMP and CUDA. To effectively hide the overhead of various inter-and intra-node communications, a new level of task parallelism is introduced on top of the conventional data parallelism. Combined with a suitable workload division between the CPUs and GPUs, our CPU+GPU programming approach is able to fully utilize the different processing units. The programming details and achievable performance are exemplified by a widely used 3D 7-point stencil computation, which shows high performance and scaling in experiments using up to 64 CPU-GPU nodes.
Significant application performance improvements can be achieved by heterogeneous compute technologies, such as multi-core CPUs, GPUs and FPGAs. The HARNESS project is developing architectural principles that enable the next generation cloud platforms to incorporate such devices thereby vastly increasing performance, reducing energy consumption, and lowering associated cost profiles. Along with management and integration of such devices in a cloud environment, a key issue is enabling enterprise-level software to make effective use of such compute devices. A major obstacle in adopting heterogeneous compute resources is the requirement that at design time the developer must decide on which device to execute portions of the application. For an interactive application, such as SAP HANA where there are many on-going tasks and processes, this type of decision is impossible to predict at design time. What is required is the ability to decide, at run-time, the optimal compute device to execute a task. This paper extends upon existing work on SHEPARD to support non-OpenCL devices. SHEPARD decouples application development from the target platform and enables the required run-time allocation of tasks to heterogeneous computing devices. This paper establishes SHEPARD's capability to: (1) select the appropriate compute device to execute tasks, (2) dynamically load the device application code at runtime, and (3) execute the application logic. Experiments demonstrate how SHEPARD optimises the execution of a SAP HANA database management function across heterogeneous compute devices and perform automatic run-time task allocation.
In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems.;;We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.
Modern computer systems become increasingly distributed and heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to use a combination of several programming models (e.g., MPI with OpenCL or CUDA) in order to exploit the system’s full performance potential. In this paper, we present dOpenCL (distributed OpenCL)—a uniform approach to programming distributed heterogeneous systems with accelerators. dOpenCL allows the user to run unmodified existing OpenCL applications in a heterogeneous distributed environment. We describe the challenges of implementing the OpenCL programming model for distributed systems, as well as its extension for running multiple applications concurrently. Using several example applications, we compare the performance of dOpenCL with MPI + OpenCL and standard OpenCL implementations.
The high computational demands and overall encoding complexity make the processing of high definition video sequences hard to be achieved in real-time. In this manuscript, we target an efficient parallelization and RD performance analysis of H.264/AVC inter-loop modules and their collaborative execution in hybrid multi-core CPU and multi-GPU systems. The proposed dynamic load balancing algorithm allows efficient and concurrent video encoding across several heterogeneous devices by relying on realistic run-time performance modeling and module-device execution affinities when distributing the computations. Due to an online adjustment of load balancing decisions, this approach is also self-adaptable to different execution scenarios. Experimental results show the proposed algorithm's ability to achieve real-time encoding for different resolutions of high-definition sequences in various heterogeneous platforms. Speed-up values of up to 2.6 were obtained when compared to the video inter-loop encoding on a single GPU device, and up to 8.5 when compared to a highly optimized multi-core CPU execution. Moreover, the proposed algorithm also provides an automatic tuning of the encoding parameters, in order to meet strict encoding constraints. © 1999-2012 IEEE.
Simulations of colliding galaxies or fluid dynamics at immersed flexible boundaries are most accurately and efficiently accomplished using the adaptive fast multipole method (AFMM) to solve an underlying n-body problem whose localized density varies with the time-dependent evolution of the system under study. Parallelization of the AFMM presents a challenging load balancing problem that must be addressed dynamically as the system evolves. We consider parallelization of the AFMM for time dependent problems using a heterogeneous shared memory compute node consisting of multi-core processors and GPU accelerators. OpenMP task parallelism is used within the CPU cores to parallelize the construction and maintenance of the adaptive spatial decomposition tree and its traversal to compute far-field interactions at each leaf node in the tree. Concurrently, GPUs evaluate all near-field interactions using all-pairs computations. In addition to accurately resolving many physical phenomena out of reach using the uniform FMM, the more complex AFMM permits the number of bodies in leaf cells to be globally and locally varied in order to minimize the CPU and GPU time. We present a cost model and incremental adjustment strategy to load balance the AFMM on a heterogeneous system. We demonstrate using these techniques that a simulation can maintain load balance over hundreds of time steps on a heterogeneous system with 10 CPU cores and 4 GPUs with less than 2% overhead, while achieving a 98X speedup over a serial computation using a single CPU core.
This work presents a hybrid computing approach which combines GPUs and multicore processors to fully take advantage of the computing power latent in modern computers. It also presents its application to the problem of tomographic reconstruction. One inherent characteristic of these modern platforms is their heterogeneity, which raises the issue of workload distribution among the different processing elements. Adaptive load balancing techniques are thus necessary to properly adjust the amount of work to be done by each computing element. Here, we have chosen the 'on-demand' strategy, a well-known technique in the HPC field by which the different elements asynchronously request a piece of work when they become idle, thereby keeping the system fairly well balanced. The results show that our scheme accommodates to the heterogeneous platform where it runs as it assigns more work to the faster processing elements automatically, which allows to correctly exploit all the resources available and to get complete reconstructions in less time than pure CPU or GPU approaches.
Many modern applications require high-performance platforms to deal with a variety of algorithms requiring massive calculations. Moreover, low-cost powerful hardware (e.g., GPU, PPU) and CPUs with multiple cores have become abundant, and can be combined in heterogeneous architectures. To cope with this, reconfigurable computing is a potential paradigm as it can provide flexibility to explore the computational resources on hybrid and multi-core desktop architectures. The workload can optimally be (re)distributed over heterogeneous cores along the lifecycle of an application, aiming for best performance. As the first step towards a run-time reconfigurable load-balancing framework, application requirements and crosscutting concerns related to timing play an important role for task allocation decisions. In this paper, we present the use of aspect-oriented paradigms to address non-functional application timing constraints in the design phase. The DERAF aspectspsila framework is extended to support reconfiguration requirements; and a strategy for load-balancing is described. In addition, we present preliminary evaluation using an Unmanned Aerial Vehicle (UAV) based surveillance system as case study.
In today's embedded systems, engineers are trying to get as much performance out of designs while minimizing the energy consumed in order to maximize battery life. Furthermore, embedded systems and their computational sub-systems are becoming more heterogeneous, containing compute resources such as general-purpose processors, graphics processing units, and FPGAs. Because of this heterogeneity, there is a rich area for optimization, especially when considering the mapping of a dynamic, real-time application to these heterogeneous resources. One approach involves maximizing the performance of a task on a given architecture with a given energy constraint. However, this approach will not minimize power and energy consumption. Therefore, in this paper, we propose new dynamic runtime optimizations that can schedule dynamic tasks to a heterogeneous system while minimizing energy consumption and deadlines missed. Through experimentation, we found improvements in energy efficiency of up to 390&#x00D7; relative to a baseline greedy scheduler.
Monte-Carlo (MC) simulation is an effective tool for solving complex problems such as many-body simulation, exotic option pricing and partial differential equation solving. The huge amount of computation in MC makes it a good candidate for acceleration using hardware and distributed computing platforms. We propose a novel MC simulation framework suitable for a wide range of problems. This framework enables different hardware accelerators in a multi-accelerator heterogeneous cluster to work collaboratively on a single application. It also provides scheduling interfaces to adaptively balance the workload according to the cluster status. Two financial applications, involving asset simulation and option pricing, are built using this framework to demonstrate its capability and flexibility. A cluster with 8 Virtex-5 xc5vlx330t FPGAs and 8 Tesla C1060 GPUs using the proposed framework provides 44 times speedup and 19.6 times improved energy efficiency over a cluster with 16 AMD Phenom 9650 quad-core 2.4GHz CPUs for the GARCH asset simulation application. The Efficient Allocation Line (EAL) is proposed for determining the most efficient allocation of accelerators for either performance or energy consumption.
Modern applications require powerful high-performance platforms to deal with many different algorithms that make use of massive calculations. At the same time, low-cost and high-performance specific hardware (e.g., GPU, PPU) are rising and the CPUs turned to multiple cores, characterizing together an interesting and powerful heterogeneous execution platform. Therefore, self-adaptive computing is a potential paradigm for those scenarios as it can provide flexibility to explore the computational resources on heterogeneous cluster attached to a high-performance computer system platform. As the first step towards a run-time reschedule load-balancing framework targeting that kind of platform, application time requirements and its crosscutting behavior play an important role for task allocation decisions. This paper presents a strategy for self-reallocation of specific tasks, including dynamic created ones, using aspect-oriented paradigms to address non-functional application timing constraints in the design phase. Additionally, as a case study, a special attention on radar image processing will be given in the context of a surveillance system based on unmanned aerial vehicles (UAV).
Accelerating breadth-first search (BFS) can be a compelling value-add given its pervasive deployment. The current state-of-the-art hybrid BFS algorithm selects different traversal directions based on graph properties, thereby, possessing heterogeneous characteristics. Related work has studied this heterogeneous BFS algorithm on homogeneous processors. In recent years heterogeneous processors have become mainstream due to their ability to maximize performance under restrictive thermal budgets. However, current software fails to fully leverage the heterogeneous capabilities of the modern processor, lagging behind hardware advancements. We propose a &#x201C;hybrid++&#x201D; BFS algorithm for an accelerated processing unit (APU), a heterogeneous processor which fuses the CPU and GPU cores on a single die. Hybrid++ leverages the strength of CPUs and GPUs for serial and data-parallel execution, respectively, to carefully partition BFS by selecting the appropriate execution-core and graph-traversal direction for every search iteration. Our results illustrate that on a variety of graphs ranging from social- to road-networks, hybrid++ yields a speedup of up to 2&#x00D7; compared to the multithreaded hybrid algorithm. Execution of hybrid++ on the APU is also 2.3&#x00D7; more energy efficient than that on a discrete GPU.
High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical dataflow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and Masc. also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013)  [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.
Almost all hardware platforms to date have been homogeneous with one or more identical processors managed by the operating system (OS). However, recently, it has been recognized that power constraints and the need for domain-specific high performance computing may lead architects towards building heterogeneous architectures and platforms in the near future. In this paper, we consider the three types of heterogeneous core architectures: (a) Virtual asymmetric cores: multiple processors that have identical core micro-architectures and ISA but each running at a different frequency point or perhaps having a different cache size, (b) Physically asymmetric cores: heterogeneous cores, each with a fundamentally different microarchitecture (in-order vs. out-of-order for instance) running at similar or different frequencies, with identical ISA and (c) Hybrid cores: multiple cores, where some cores have tightly-coupled hardware accelerators or special functional units. We show case studies that highlight why existing OS and hardware interaction in such heterogeneous architectures is inefficient and causes loss in application performance, throughput efficiency and lack of quality of service. We then discuss hardware and software support needed to address these challenges in heterogeneous platforms and establish efficient heterogeneous environments for platforms in the next decade. In particular, we will outline a monitoring and prediction framework for heterogeneity along with software support to take advantage of this information. Based on measurements on real platforms, we will show that these proposed techniques can provide significant advantage in terms of performance and power efficiency in heterogeneous platforms.
Recently, hybrid CPU/GPU cluster has been widely used to deal with compute-intensive problems, such as the subset-sum problem. The two-list algorithm is a well known approach to solve the problem. However, a hybrid MPI-CUDA dual-level parallelization of the algorithm on the cluster is not straightforward. The key challenge is how to allocate the most suitable workload to each node to achieve good load balancing between nodes and minimize the communication overhead. Therefore, this paper proposes an effective workload distribution scheme which aims to reasonably assign workload to each node. According to this scheme, an efficient MPI-CUDA parallel implementation of a two-list algorithm is presented. A series of experiments are conducted to compare the performance of the hybrid MPI-CUDA implementation with that of the best sequential CPU implementation, the single-node CPU-only implementation, the single-node GPU-only implementation, and the hybrid MPI-OpenMP implementation with same cluster configuration. The results show that the proposed hybrid MPI-CUDA implementation not only offers significant performance benefits but also has excellent scalability.
Adopting multiple processing units to enhance the computing capability or reduce the power consumption has been widely accepted for designing modern computing systems. Such configurations impose challenges on energy efficiency in hardware and software implementations. This work targets power-aware and energy-efficient task partitioning and processing unit allocation for periodic real-time tasks on a platform with a library of applicable processingunit types. Each processing unit type has its own power consumption characteristics for maintaining its activeness and executing jobs. This paper proposes polynomial-time algorithms for energy-aware task partitioning and processing unit allocation. The proposed algorithms first decide how to assign tasks onto processing unit types to minimize the energy consumption, and then allocate processing units to fit the demands. The proposed algorithms for systems without limitation on the allocated processing units are shown with an (m+1)-approximation factor, where mis the number of the available processing unit types. For systems with limitation on the number of the allocated processing units, the proposed algorithm is shown with bounded resource augmentation on the limited number of allocated units. Experimental results show that the proposed algorithms are effective for the minimization of the overall energy consumption. © 2009 IEEE.
The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context, we evaluate the concept of retargeting a single OpenCL program to multiple platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs, and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g., 8,000 bit) to long length (e.g., 64,800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs.
Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous acceleratorsas ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with generalpurpose CPU cores, and (2) C for Heterogeneous Integration(CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages. The CHI compiler extends the OpenMP pragma for heterogeneous multithreading programming, and produces a single fat binary with code sections corresponding to different instruction sets. The runtime can judiciously spread parallel computation across the heterogeneous cores to optimize performance and power. We have prototyped the EXO architecture on a physical heterogeneous platform consisting of an Intel Core 2 Duo processor and an 8-core 32-thread Intel Graphics Media Accelerator X3000. In addition, we have implemented the CHI integrated programming environment with the Intel C++ Compiler, runtime toolset, and debugger. On the EXO prototype system, we have enhanced a suite of production-quality media kernels for video and image processing to utilize the accelerator through the CHI programming interface, achieving significant speedup (1.41X to10.97X) over execution on the IA32 CPU alone. Copyright © 2007 ACM.
Lead by high performance computing potential of modern heterogeneous desktop systems and predominance of video content in general applications, we propose herein an autonomous unified video encoding framework for hybrid multi-core CPU and multi-GPU platforms. To fully exploit the capabilities of these platforms, the proposed framework integrates simultaneous execution control, automatic data access management, and adaptive scheduling and load balancing strategies to deal with the overall complexity of the video encoding procedure. These strategies consider the collaborative inter-loop encoding as a unified optimization problem to efficiently exploit several levels of concurrency between computation and communication. To support a wide range of CPU and GPU architectures, a specific encoding library is developed with highly optimized algorithms for all inter-loop modules. The obtained experimental results show that the proposed framework allows achieving a real-time encoding of full high-definition sequences in the state-of-the-art CPU+GPU systems, by outperforming individual GPU and quad-core CPU executions for more than 2 and 5 times, respectively.
Embedded systems have become an integral part of High Performance Computing (HPC) due to their appealing energy and resource consumption characteristics. Required performance goals can be achieved only by deploying the application on a heterogeneous platform. The established approach of designing a custom made FPGA architecture platform targeting particular application is challenged by novel multiprocessor heterogeneous platforms built using existing of-the-shelf embedded CPUs. The challenge is to exploit all the available parallelism and heterogeneity to design a time and cost efficient, but also a reusable solution which will meet the performance goals. In this paper the heterogeneity of parallel SoC system is evaluated through different processor cores and memory configurations usage examination. The design space exploration undertaken relies on several hypotheses concerning design concepts relations. Since system operating frequency is crucial, but not the only design performance success parameter, the importance of operating processor data path is emphasized as it conducts the application mapping approach. We show that, depending on the different heterogeneous elements configurations used within some multicore SoC solution, the suitability of particular processor cores usage varies on application type, but with achieved performance comparable to application-specific custom generated hardware.
Emerging heterogeneous and homogeneous processing architectures demonstrate significant increases in throughput for scientific applications over traditional single core processors. Each of these processing architectures vary widely in their processing capabilities, memory hierarchies, and programming models. Determining the system architecture best suited to an application or deploying an application that is portable across a number of different platforms is increasingly complex and error prone within this rapidly increasing and evolving design space. Quickly and easily designing portable, high-performance applications that can function and maintain their correctness properly across these widely varied systems has become paramount. To deal with these programming challenges, there is a great need for new models and tools to be developed. One example is MIT Lincoln Laboratory's Parallel Vector Tile Optimizing Library (PVTOL) which simplifies the task of developing software in C++ for these complex systems. This work extends the Tasks and Conduits framework in PVTOL to support GPU architectures and other heterogeneous platforms supported by the NVIDIA CUDA and OpenCL programming models. This allows the rapid portability of applications to a very wide range of architectures and clusters. Using this framework, porting applications from a single CPU core to a GPU requires a change of only 5 source lines of code (SLOC) in addition to the CUDA or OpenCL kernel. Using GPU-PVTOL we have achieved 22x speedup in an application of Monte Carlo simulations of photon propagation through a biological medium, and a 60x speedup of a 3D cone beam computed tomography (CT) image reconstruction algorithm.
Hierarchical level of heterogeneity exists in many modern high performance clusters in the form of heterogeneity between computing nodes, and within a node with the addition of specialized accelerators, such as GPUs. To achieve high performance of scientific applications on these platforms it is necessary to perform load balancing. In this paper we present a hierarchical matrix partitioning algorithm based on realistic performance models at each level of hierarchy. To minimise the total execution time of the application it iteratively partitions a matrix between nodes and partitions these sub-matrices between the devices in a node. This is a self-adaptive algorithm that dynamically builds the performance models at run-time and it employs an algorithm to minimise the total volume of communication. This algorithm allows scientific applications to perform load balanced matrix operations with nested parallelism on hierarchical heterogeneous platforms. To show the effectiveness of the algorithm we applied it to a fundamental operation in scientific parallel computing, matrix multiplication. Large scale experiments on a heterogeneous multi-cluster site incorporating multicore CPUs and GPU nodes show that the presented algorithm outperforms current state of the art approaches and successfully load balance very large problems.
Growing demand for energy-efficient, high-performance systems has resulted in the growth of innovative heterogeneous computing system architectures that use FPGAs. FPGA-based architectures enable designers to implement custom instruction streams executing on potentially thousands of compute elements. Traditionally, FPGAs have been used as compute elements on PCI devices; however, this does not allow the FPGAs to be coprocessors. This paper describes a high-performance system architecture that is based on the Intel® Xeon® platform in which one or more FPGAs, acting as application accelerators, replace one or more processors in a dual/multi-processor (DP/MP) platform. The FPGA is thus connected directly to the Front Side Bus (FSB) and enjoys the same privileges as a processor, i.e., full participation in the coherency protocol, unrestricted access to system memory and to other processors via the high bandwidth, and low latency connection to the FSB. In addition, we also describe a software layer called the "Accelerator Abstraction Layer (AAL)", which provides a uniform, hardware- and/or platform-independent application interface. Applications written on AAL can be ported to multiple platforms that have different types of accelerators and the application does not have to be modified. In addition, the AAL also enables the developer/user to reprogram the FPGA on the fly (analogous to an operating system context switch) thereby utilizing the programmable nature of the FPGA. The resulting hardware/software stack creates a flexible and powerful platform for accelerator innovation and deployment. Copyright 2009 ACM.
Modern computers are equipped with powerful computing engines like multicore processors and GPUs. The 3DEM community has rapidly adapted to this scenario and many software packages now make use of high performance computing techniques to exploit these devices. However, the implementations thus far are purely focused on either GPUs or CPUs. This work presents a hybrid approach that collaboratively combines the GPUs and CPUs available in a computer and applies it to the problem of tomographic reconstruction. Proper orchestration of workload in such a heterogeneous system is an issue. Here we use an on-demand strategy whereby the computing devices request a new piece of work to do when idle. Our hybrid approach thus takes advantage of the whole computing power available in modern computers and further reduces the processing time. This CPU+GPU co-processing can be readily extended to other image processing tasks in 3DEM.
The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores.
Many-core accelerators are being more frequently deployed to improve the system processing capabilities. In such systems, application mapping must be enhanced to maximize utilization of the underlying architecture. Especially, in graphics processing units (GPUs), mapping kernels that are part of multi-kernel applications has a great impact on overall performance, since kernels may exhibit different characteristics on different CPUs and GPUs. While some kernels run faster on GPUs, others may perform better in CPUs. Thus, heterogeneous execution may yield better performance than executing the application only on a CPU or only on a GPU. In this paper, we investigate on two approaches: a novel profiling-based adaptive kernel mapping algorithm to assign each kernel of an application to the proper device, and a Mixed-Integer Programming (MIP) implementation to determine optimal mapping. We utilize profiling information for kernels on different devices and generate a map that identifies which kernel should run where in order to improve the overall performance of an application. Initial experiments show that our approach can efficiently map kernels on CPUs and GPUs, and outperforms CPU-only and GPU-only approaches.
Multi-core CPU and GPU technologies upgrades a PC to a personal supercomputer. Many algorithms have been proposed to achieve dynamic scheduling in CPU-GPU hybrid environments. Among them, only pure self-scheduling (PSS) can achieve perfect load balancing in this extremely heterogeneous environment. But PSS can not take full advantage of GPU performance, reduce the overhead of the tail problem, and reduce the dynamic scheduling overhead. In this paper, load-prediction scheduling (LPS) was introduced to solve the above problems. To demonstrate the efficiency of LPS in practical applications, it was implemented to parallelize computer simulation of electrocardiogram. Experimental results of LPS on the computer simulation of electrocardiogram (ECG) show that the LPS algorithm is more efficient than PSS.
Heterogeneous architectures consisting of general-purpose CPUs and throughput-optimized GPUs are projected to be the dominant computing platforms for many classes of applications. The design of such systems is more complex than that of homogeneous architectures because maximizing resource utilization while minimizing shared resource interference between CPU and GPU applications is difficult. We show that GPU applications tend to monopolize the shared hardware resources, such as memory and network, because of their high thread-level parallelism (TLP), and discuss the limitations of existing GPU-based concurrency management techniques when employed in heterogeneous systems. To solve this problem, we propose an integrated concurrency management strategy that modulates the TLP in GPUs to control the performance of both CPU and GPU applications. This mechanism considers both GPU core state and system-wide memory and network congestion information to dynamically decide on the level of GPU concurrency to maximize system performance. We propose and evaluate two schemes: one (CM-CPU) for boosting CPU performance in the presence of GPU interference, the other (CM-BAL) for improving both CPU and GPU performance in a balanced manner and thus overall system performance. Our evaluations show that the first scheme improves average CPU performance by 24%, while reducing average GPU performance by 11%. The second scheme provides 7% average performance improvement for both CPU and GPU applications. We also show that our solution allows the user to control performance trade-offs between CPUs and GPUs.
In this paper we explore mapping of a high-level macro data-flow programming model called Concurrent Collections (CnC) onto heterogeneous platforms in order to achieve high performance and low energy consumption while preserving the ease of use of data-flow programming. Modern computing platforms are becoming increasingly heterogeneous in order to improve energy efficiency. This trend is clearly seen across a diverse spectrum of platforms, from small-scale embedded SOCs to large-scale super-computers. However, programming these heterogeneous platforms poses a serious challenge for application developers. We have designed a software flow for converting high-level CnC programs to the Habanero-C language. CnC programs have a clear separation between the application description, the implementation of each of the application components and the abstraction of hardware platform, making it an excellent programming model for domain experts. Domain experts can later employ the help of a tuning expert (either a compiler or a person) to tune their applications with minimal effort. We also extend the Habanero-C runtime system to support work-stealing across heterogeneous computing devices and introduce task affinity for these heterogeneous components to allow users to fine tune the runtime scheduling decisions. We demonstrate a working example that maps a pipeline of medical image-processing algorithms onto a prototype heterogeneous platform that includes CPUs, GPUs and FPGAs. For the medical imaging domain, where obtaining fast and accurate results is a critical step in diagnosis and treatment of patients, we show that our model offers up to 17.72X speedup and an estimated usage of 0.52X of the power used by CPUs alone, when using accelerators (GPUs and FPGAs) and CPUs.
Efficient mapping and scheduling of partitioned applications are crucial to improve the performance on today's reconfigurable multiprocessor systems-on-chip (MPSoCs) platforms. Most of existing heuristics adopt the directed acyclic (task) Graph as representation, that unfortunately, is not able to represent typical embedded applications (e.g., real-time and loop-partitioned). In this paper we propose a novel approach, based on Ant Colony Optimization, that explores different alternative designs to determine an efficient hardware-software partitioning, to decide the task allocation and to establish the execution order of the tasks, dealing with different design constraints imposed by a reconfigurable heterogeneous MPSoC. Moreover, it can be applied to any parallel C application, represented through Hierarchical Task Graphs. We show that our methodology, addressing a realistic target architecture, outperforms existing approaches on a representative set of embedded applications.
The single core processor, which has dominated for over 30 years, is now obsolete with recent trends increasing towards parallel systems, demanding a huge shift in programming techniques and practices. Moreover, we are rapidly moving towards an age where almost all programming will be targeting parallel systems. Parallel hardware is rapidly evolving, with large heterogeneous systems, typically comprising a mixture of CPUs and GPUs, becoming the mainstream. Additionally, with this increasing heterogeneity comes increasing complexity: not only does the programmer have to worry about where and how to express the parallelism, they must also express an efficient mapping of resources to the available system. This generally requires in-depth expert knowledge that most application programmers do not have. In this paper we describe a new technique that derives, automatically, optimal mappings for an application onto a heterogeneous architecture, using a Monte Carlo Tree Search algorithm. Our technique exploits high-level design patterns, targeting a set of well-specified parallel skeletons. We demonstrate that our MCTS on a convolution example obtained speedups that are within 5% of the speedups achieved by a hand-tuned version of the same application.
A recent development of heterogeneous platforms (i.e. those containing different types of computing units such as multicore CPUs, GPUs, and FPGAs) has enabled significant improvements in performance processing large amount of data in realtime. This possibility however is still not fully utilized due to a lack of methods for optimal configuration of software; the allocation of different software components to different computing unit types is crucial for getting the maximal utilization of the platform, but for more complex systems it is difficult to find ad-hoc a good enough or the best configuration. In this paper we present an approach to find a feasible and locally optimal solution for allocating software components to processing units in a heterogeneous platform.
In the recent years, high performance computing (HPC) resources has grown up rapidly and diversely. The next generation of HPC platforms is assembled from resources of various types such as multi-core CPUs and GPUs. Thus, the development of a parallel program to fully utilize heterogeneously distributed resources in HPC environment is a challenge. A parallel program should be portable and able to run efficiently on all types of computing resources with the least effort. We combine the advantages of Global Arrays and OpenCL for such the parallel programs. We employ the OpenCL in implementing parallel applications at fine-grain level so that they can execute across heterogeneous platforms. At coarse grain level, we utilize the Global Arrays for efficient data communication between computing resources in terms of virtually shared memory. In addition, we also propose a load balancing technique based on the task pool model for hybrid OpenCL/Global Arrays applications on heterogeneous platforms to improve the performance of the applications.
The efficient utilization of computing resources, consisting of multi-core CPUs, GPUs and FPGAs, has become an interesting research problem for achieving high performance on heterogeneous Cloud computing platforms. In particular, FPGA accelerators can provide significant business value in Cloud environments due to its great computing capacity with predictable latency and low power consumption. In this paper, a Software as a Service (SaaS) model is enhanced with Quality of Service (QoS) support, harnessing such heterogeneous hardware architecture (composed of conventional CPUs plus FPGAs as accelerator). More precisely, the proposal takes into account timing user requirements to manage virtual resources. Hence, novel heterogeneous-aware resource allocation and scheduling algorithms are presented, which can be used both on-demand and in-advance. A lineal regression model that predicts the cost of the requested service is combined with a simple heuristic algorithm in order to allocate different types of Virtual Machines (VMs). Moreover, the framework provides the service efficiently by using an adapted scheduling algorithm that combines CPUs and accelerator resources.
OpenCL is now available on a very large set of processors. This makes this language an attractive layer to address multiple targets with a single code base. The question on how sensitive to the underlying hardware is the OpenCL code in practice remains to be better understood. 1. This paper studies how realistic it is to use a unique OpenCL code for a set of hardware co-processors with different underlying micro-architectures. In this work, we target Intel Xeon Phi, NVIDIA K20C and AMD 7970 GPU. All these accelerators provide at least support for OpenCL 1.1 and fit in the same high-end version of accelerator technology. To assess performance, we use OpenACC CAPS compiler to generate OpenCL code and use a moderately complex miniapplication, Hydro. This code uses 22 OpenCL kernels and was tailored to limit data transfers between the host and the accelerator device. To study how stable are the performance, we performed many experiments to determine the best OpenCL code for each hardware platform. This paper shows that, if well chosen, a single version of the code can be executed on multiple platforms without significant performance losses (less than 12%). This study confirms the need for auto-tuning technology to look for performance tradeoffs but also shows that deploying selftuning/adaptive code is not always necessary if the ultimate performance is not the goal.
OpenCL and OpenACC are generic frameworks for heterogeneous programming using CPU and accelerator devices such as GPUs. They have contrasting features: the former explicitly controls devices through API functions, while the latter generates such procedures along a guide of the directives inserted by a programmer. In this paper, we apply these two frameworks to a general-purpose code set for numerical simulations of lattice QCD, which is a computational physics of elementary particles based on the Monte Carlo method. The fermion matrix inversion, which is usually the most time-consuming part of the lattice QCD simulations, is offloaded to the accelerator devices. From a viewpoint of constructing reusable components based on the object-oriented programming and also tuning the code to achieve high performance, we discuss feasibility of these frameworks through the practical implementations.
Recent computer systems and handheld devices are equipped with high computing capability, such as general purpose GPUs (GPGPU) and multi-core CPUs. Utilizing such resources for computation has become a general trend, making their availability an important issue for the real-time aspect. Discrete cosine transform (DCT) and quantization are two major operations in image compression standards that require complex computations. In this paper, we develop an efficient parallel implementation of the forward DCT and quantization algorithms for JPEG image compression using Open Computing Language (OpenCL). This OpenCL-based parallel implementation utilizes a multi-core CPU and a GPGPU to perform DCT and quantization computations. We demonstrate the capability of this design via two proposed working scenarios. The proposed approach also applies certain optimization techniques to improve the kernel execution time and data movements. We developed an optimal OpenCL kernel for a particular device using device-based optimization factors, such as thread granularity, work-items mapping, workload allocation, and vector-based memory access. We evaluated the performance in a heterogeneous environment, finding that the proposed parallel implementation was able to speed up the execution time of the DCT and quantization by factors of 7.97 and 8.65, respectively, obtained from 1024 × 1024 and 2084 × 2048 image sizes in 4:4:4 format. © 2015 Springer-Verlag Berlin Heidelberg
ADAS (Advanced Driver Assistance Systems) algorithms increasingly use heavy image processing operations. To embed this type of algorithms, semiconductor companies offer many heterogeneous architectures. These SoCs (System on Chip) are composed of different processing units, with different capabilities, and often with massively parallel computing unit. Due to the complexity of these SoCs, predicting if a given algorithm can be executed in real time on a given architecture is not trivial. In fact it is not a simple task for automotive industry actors to choose the most suited heterogeneous SoC for a given application. Moreover, embedding complex algorithms on these systems remains a difficult task due to heterogeneity, it is not easy to decide how to allocate parts of a given algorithm on the different computing units of a given SoC. In order to help automotive industry in embedding algorithms on heterogeneous architectures, we propose a novel approach to predict performances of image processing algorithms applicable on different types of computing units. Our methodology is able to predict a more or less wide interval of execution time with a degree of confidence using only high level description of algorithms, and a few characteristics of computing units.
We develop optimized multi-dimensional FFT implementations on CPU-GPU heterogeneous platforms for the case when the input is too large to fit on the GPU global memory, and use the resulting techniques to develop a fast Poisson solver. The solver involves memory bound computations for which the large 3D data may have to be transferred over the PCIe bus several times during the computation. We develop a new strategy to decompose and allocate the computation between the GPU and the CPU such that the 3D data is transferred only once to the device memory, and the executions of the GPU kernels are almost completely overlapped with the PCI data transfer. We were able to achieve significantly better performance than what has been reported in previous related work, including over 145 GFLOPS for the three periodic boundary conditions (single precision version), and over 105 GFLOPS for the two periodic, one Neumann boundary conditions (single precision version). The effective bidirectional PCIe bus bandwidth achieved is 9-10 GB/s, which is close to the best possible on our platform. For all the cases tested, the single 3D data PCIe transfer time, which constitutes a lower bound on what is possible on our platform, takes almost 70% of the total execution time of the Poisson solver. © 2014 Elsevier Inc. All rights reserved.
EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components. Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:• method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations; • method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques; • method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources; • approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs. Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively.
In the last years, the performance and capabilities of graphics processing units (GPUs) improved drastically, mostly due to the demands of the entertainment market, with consumers and companies alike pushing for improvements in the level of visual fidelity, which is only achieved with high performing GPU solutions. Beside the entertainment market, there is an ongoing global research effort for using such immense computing power for applications beyond graphics, such as the domain of general purpose computing. Efficiently combining these GPUs resources with existing CPU resources is also an important and open research task. This paper is a contribution to that effort, focusing on analysis of performance factors of combining both resource types, while introducing also a novel job scheduler that manages these two resources. Through experimental performance evaluation, this paper reports what are the most important factors and design considerations that must be taken into account while designing such job scheduler.
Mobile platforms have started to employ FPGA based hardware accelerators to address the ever-increasing demand for computing performance. For many applications, the use of an operating system on the hardware platform proves beneficial for reasons of better resource management and more robust security. This paper evaluates the performance-power implications in a signal processing algorithm with respect to the introduction of Linux OS on two different architectures (a CPU-based and an FPGA-based hybrid architecture). The results reveal that there is a 22 times improvement in energy budget between the CPU-based implementation and the FPGA-based hybrid implementation, with a negligible performance degradation due to the introduction of Linux OS.
The use of GPU clusters for scientific applications in areas such as physics, chemistry and bioinformatics is becoming more widespread. These clusters frequently have different types of processing devices, such as CPUs and GPUs, which can themselves be heterogeneous. To use these devices in an efficient manner, it is crucial to find the right amount of work for each processor that balances the computational load among them. This problem is not only NP-hard on its essence, but also tricky due to the variety of architectures of those devices. We present PLB-HeC, a Profile-based Load-Balancing algorithm for Heterogeneous CPU-GPU Clusters that performs an online estimation of performance curve models for each GPU and CPU processor. Its main difference to existing algorithms is the generation of a non-linear system of equations representing the models and its solution using a interior point method, improving the accuracy of block distribution among processing units. We implemented the algorithm in the StarPU framework and compared its performance with existing load-balancing algorithms using applications from linear algebra, stock markets and bioinformatics. We show that it reduces the application execution times in almost all scenarios, when using heterogeneous clusters with two or more machine configurations.
Future computer systems are built under much stringent power budget due to the limitation of power delivery and cooling systems. To this end, sophisticated power management techniques are required. Power capping is a technique to limit the power consumption of a system to the predetermined level, and has been extensively studied in homogeneous systems. However, few studies about the power capping of CPU-GPU heterogeneous systems have been done yet. In this paper, we propose an efficient power capping technique through coordinating DVFS and task mapping in a single computing node equipped with GPUs. In CPU-GPU heterogeneous systems, settings of the device frequencies have to be considered with task mapping between the CPUs and the GPUs because the frequency scaling can incurs load imbalance between them. To guide the settings of DVFS and task mapping for avoiding power violation and the load imbalance, we develop new empirical models of the performance and the maximum power consumption of a CPU-GPU heterogeneous system. The models enable us to set near-optimal settings of the device frequencies and the task mapping in advance of the application execution. We evaluate the proposed technique with five data-parallel applications on a machine equipped with a single CPU and a single GPU. The experimental result shows that the performance achieved by the proposed power capping technique is comparable to the ideal one.
This paper presents a power-aware scheduling algorithm based on efficient distribution of the computing workload to the resources on heterogeneous CPU-GPU architectures. The scheduler manages the resources of several computing nodes with a view to reducing the peak power. The algorithm can be used in concert with adjustable power state software services in order to further reduce the computing cost during high demand periods. Although our study relies on GPU workloads, the approach can be extended to other heterogeneous computer architectures. The algorithm has been implemented in a real CPU-GPU heterogeneous system. Experiments prove that the approach presented reduces peak power by 10 percent compared to a system without any power-aware policy and by up to 24 percent with respect to the worst case scenario with an execution time increase in the range of 2 percent. This leads to a reduction in the system and service costs.
Designing embedded high-performance systems is challenging due to complex algorithms, real-time operations and conflicting goals (e.g. power v.s. performance). Heterogeneous platforms that combine processors and custom hardware accelerators are a promising approach. However, manually designing HW/SW systems is prohibitively expensive due to the immense manual effort. This paper introduces SimSH: Simulink Sw/Hw CoDesign Framework, which provides an automatic path from an algorithm captured in Simulink to a heterogeneous implementation. Given an allocation and a mapping decision, the SimSH automatically synthesizes the Simulink model onto the heterogeneous target with reconstruction of the synchronization and communication between processing elements. In the process, the SimSH detects an underutilized bus and optimizes communication by packing/unpacking. Synthesizing a heterogeneous implementation from Simulink allows the developer to focus on the algorithm design with rapid validation and test on a heterogeneous platform. We demonstrate synthesis benefits using a Sobel Edge Detection algorithm and target a heterogeneous architecture of Blackfin processor and Spar-tan3E FPGA. The synthesized solution is 2.68x faster (and energy efficient) over pure SW execution.
Modern surveillance systems, such as those based on the use of unmanned aerial vehicles, required powerful high-performance platforms to deal with many different algorithms that make use of massive calculations. At the same time, low-cost and high-performance specific hardware (e.g., GPU, PPU) are rising and the CPUs turned to multiple cores, characterizing together an interesting and powerful heterogeneous execution platform. Therefore, reconfigurable computing is a potential paradigm for those scenarios as it can provide flexibility to explore the computational resources on heterogeneous cluster attached to a high-performance computer system platform. As the first step towards a run-time reconfigurable workload balancing framework targeting that kind of platform, application time requirements and its crosscutting behavior play an important role for task allocation decisions. This paper presents a strategy to reallocate specific tasks in a surveillance system composed by a fleet of unmanned aerial vehicles using aspect-oriented paradigms in order to address non-functional application timing constraints in the design phase. An aspect support from a framework called DERAF is used to support reconfiguration requirements and provide the resource information needed by the reconfigurable load-balancing strategy. Finally, for the case study, a special attention on radar image processing will be given.
Today's systems from smartphones to workstations are becoming increasingly parallel and heterogeneous: Processing units not only consist of more and more identical cores - furthermore, systems commonly contain either a discrete general-purpose GPU alongside with their CPU or even integrate both on a single chip. To benefit from this trend, software should utilize all available resources and adapt to varying configurations, including different CPU and GPU performance or competing processes. This paper investigates parallelization and adaptation strategies applied to the example application of dense stereo vision, which forms a basis i.a. for advanced driver assistance systems, robotics or gesture recognition and represents a broad range of similar computer vision methods. For this problem, task-driven as well as data element- and data flow-driven parallelization approaches are feasible. To achieve real-time performance, we first utilize data element-parallelism individually on each device. On this basis, we develop and implement strategies for cooperation between heterogeneous processing units and for automatic adaptation to the hardware available at run-time. Each approach is described concerning i.a. the propagation of data to processors and its relation to established methods. An experimental evaluation with multiple test systems reveals advantages and limitations of each strategy.
Multimedia applications, like, e.g., 3-D games and video decoders, are typically composed of communicating tasks. Their target embedded computing platforms (e.g., TI OMAP3, IBM Cell) contain multiple heterogeneous processing elements. At application design-time, it is often unknown which applications will execute simultaneously. Hence, resource assignment decisions need to be made by a run-time manager. Run-time assignment of these communicating tasks onto the communication and computation resources of such a multiprocessor platform is a challenging task. In the presence of fine-grain reconfigurable hardware processing elements, the run-time manager also needs to consider the creation of a so-called configuration hierarchy. Instead of executing a dedicated hardware task, the fine-grain reconfigurable hardware fabric hosts a programmable softcore block that, in turn, executes the task functionality. Hence, the next challenge for run-time management is to efficiently handle a configuration hierarchy. This paper details a run-time task assignment heuristic that performs fast and efficient task assignment in a multiprocessor system-on-chip containing fine-grain reconfigurable hardware tiles. In addition, this algorithm is capable of managing a configuration hierarchy. We show that being capable of handling a configuration hierarchy significantly improves the task assignment performance (i.e., success rate and assignment quality). In several cases, adding a configuration hierarchy improves the assignment success rate of the assignment heuristic by 20%.
Heterogeneity is increasing at all levels of computing, certainly with the rise in general purpose computing with GPUs in everything from phones to supercomputers. More quietly it is increasing with the rise of NUMA systems, hierarchical caching, OS noise, and a myriad of other factors. As heterogeneity becomes a fact of life at every level of computing, efficiently managing heterogeneous compute resources is becoming a critical task. The focus of my dissertation is developing methods and systems to allow software to adapt to the heterogeneous hardware it finds at runtime. The goal is to make the complex functions of heterogeneous computing autonomic, handling load balancing, memory coherence and other performance critical factors in the runtime. The investigation began by studying heterogeneity caused by system topology and resource contention in MPI applications. Since then the focus has shifted to work-sharing across CPU and GPU resources for accelerated OpenMP, and automatically managing the hardware capability imbalances between these resources. Moving forward, I propose to produce a system extending upon both previous approaches to offer work-sharing, topology aware affinity management, as well as novel automated memory transformations to reduce communication and increase memory access efficiency.
Modern computing systems featuring different kinds of processing elements have proven to be efficient in terms of performance/energy trade-offs. Furthermore these systems usually have to execute multiple concurrent tasks without any apriori knowledge on expected arrival times, in an unpredictable and very dynamic environment. This scenario has propelled an interest towards self-adaptive systems that dynamically reorganize the use of system resources to optimize for a given goal. The SAVE project will develop a Heterogeneous System Architecture that will decide at runtime to execute task on the appropriate kind of resources, based on the current requirements. This paper presents a first implementation of a resource allocation policy that dynamically shares heterogeneous resources between multiple running applications. Resource allocation mechanisms are discussed and evaluated in an experimental campaign, showing how the policy helps in attaining users' applications goals.
The recent years have shown the emergence of heterogeneous system architecture (HSA), which offers massive computational power assembled into a compact design. Computer vision applications with massive inherent parallelism highly benefits from such heterogeneous processors with on-chip CPU and GPU units. The highly parallel and compute intensive parts of the application program can be mapped to the GPU while the control flow and high level tasks may run on the CPU. However, they pose considerable challenge to software development due to their hybrid architecture. Sharing of resources (GPU or CPU) among applications running concurrently, leads to variations in processing interval and prolonged processing intervals leads to low quality results (frame drops) for computer vision algorithms. In this work, we propose resource-awareness and self organisation within the application layer to adapt to available resources on the heterogeneous processor. The benefits of the new model is demonstrated using a widely used computer vision algorithm called Harris corner detector. A resource-aware runtime-system and a heterogeneous processor were used for evaluation and the results indicate a well constrained processing interval and reduced frame-drops. Our evaluations demonstrate up to 20% improvements in processing rate and accuracy of the detected corner points for Harris corner detection.
As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.
Heterogeneous computing technologies, such as multi-core CPUs, GPUs and FPGAs can provide significant performance improvements. However, developing applications for these technologies often results in coupling applications to specific devices, typically through the use of proprietary tools. This paper presents SHEPARD, a compile time and run-time framework that decouples application development from the target platform and enables run-time allocation of tasks to heterogeneous computing devices. Through the use of special annotated functions, called managed tasks, SHEPARD approximates a task's performance on available devices, and coupled with the approximation of current device demand, decides which device can satisfy the task with the lowest overall execution time. Experiments using a task parallel application, based on an in-memory database, demonstrate the opportunity for automatic run-time task allocation to achieve speed-up over a static allocation to a single specific device.
Distributing the workload upon all available Processing Units (PUs) of a high-performance heterogeneous platform (e.g., PCs composed by CPU–GPUs) is a challenging task, since the execution cost of a task on distinct PUs is non-deterministic and affected by parameters not known a priori. This paper presents Sm@rtConfig, a context-aware runtime and tuning system based on a compromise between reducing the execution time of engineering applications and the cost of tasks' scheduling on CPU–GPUs' platforms. Using Model-Driven Engineering and Aspect Oriented Software Development, a high-level specification and implementation for Sm@rtConfig has been created, aiming at improving modularization and reuse in different applications. As case study, the simulation subsystem of a CFD application has been developed using the proposed approach. These system's tasks were designed considering only their functional concerns, whereas scheduling and other non-functional concerns are handled by Sm@rtConfig aspects, improving tasks modularity. Although Sm@rtConfig supports multiple PUs, in this case study, these tasks have been scheduled to execute on an platform composed by one CPU and one GPU. Experimental results show an overall performance gain of 21.77% in comparison to the static assignment of all tasks only to the GPU.
Heterogeneous architectures are being used extensively to improve system processing capabilities. Critical functions of each application (kernels) can be mapped to different computing devices (i.e. CPUs, GPGPUs, accelerators) to maximize performance. However, best performance can only be achieved if kernels are accurately mapped to the right device. Moreover, in some cases those kernels could be split and executed over several devices at the same time to maximize the use of compute resources on heterogeneous parallel architectures. In this paper, we define a static partitioning model based on profiling information from previous executions. This model follows a quantitative model approach which computes the optimal match according to user-defined constraints. We test different scenarios to evaluate our model: single kernel and multi-kernel applications. Experimental results show that our static partitioning model could increase performance of parallel applications by deploying not only different kernels over different devices but a single kernel over multiple devices. This allows to avoid having idle compute resources on heterogeneous platforms, as well as enhancing the overall performance. © 2015 Elsevier B.V. All rights reserved.
Graphics processing units (GPUs) are increasingly being used for general purpose parallel computing. They provide significant performance gains over multi-core CPU systems, and are an easily accessible alternative to supercomputers. The architecture of general purpose GPU systems(GPGPU), however, poses challenges in efficiently transferring data among the host and device(s). Although commodity many core devices such as NVIDIA GPUs provide more than one way to move data around, it is unclear which method is most effective given a particular application. This presents difficulty in supporting latency-sensitive cyber-physical systems (CPS). In this work we present a new approach to data transfer in a heterogeneous computing system that allows direct communication between GPUs and other I/O devices. In addition to adding this functionality our system also improves communication between the GPU and host. We analyze the current vendor provided data communication mechanisms and identify which methods work best for particular tasks with respect to throughput, and total time to completion. Our method allows a new class of real-time cyber-physical applications to be implemented on a GPGPU system. The results of the experiments presented here show that GPU tasks can be completed in 34 percent less time than current methods. Furthermore, effective data throughput is at least as good as the current best performers. This work is part of concurrent development of Gdev, an open-source project to provide Linux operating system support of many-core device resource management.
In the domain of high performance computing, software deployment on heterogeneous distributed processing units has been in practice for many years. However, new hardware technologies, increased complexity of software and significant increase of requirements demand new methods that can manage these concerns in an efficient way. In this paper we propose a new optimization framework that in a systematic way addresses a general allocation model, the software and deployment architectures, and, based on the user preferences, provides a software deployment solution regardless of number of quality attributes used. Additionally, we present the input models to the allocation process capable of describing a number of software and hardware configurations, and a two-step allocation algorithm capable of harnessing these models.
The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level frameworks with architecture-specific optimizations, which in turn cause the code base to diverge and makes porting difficult. Our experiences with parallel applications and frameworks lead us to the conclusion that achieving performance portability requires a common set of high-level directives and efficient mapping onto each architecture. In order to demonstrate this concept, we develop Trellis, a prototype programming framework that allows the programmer to maintain only a single generic and structured codebase that executes efficiently on both the CPU and the GPU. Our approach annotates such code with a single set of high-level directives, derived from both OpenMP and OpenACC, that is made compatible for both architectures. Most importantly, motivated by the limitations of the OpenACC compiler in transforming such code into a GPU kernel, we introduce a thread synchronization directive and a set of transformation techniques that allow us to obtain the GPU code with the desired parallelization that yields more optimal performance. While a common high-level programming framework for both CPU and GPU is not yet available, our analysis shows that even obtaining the best-case GPU performance with OpenACC, state-of-the-art solution, requires modifications to the structure of codes to properly exploit braided parallelism, and cope with conditional statements or serial sections. While this already requires prior knowledge of compiler behavior the optimal performance is still unattainable due to the lack of synchronization. We describe the contributions of Trellis in addressing these problems by showing how it can achieve correct parallelization of the original codes for three parallel applications, with performance competitive to that of OpenMP and CUDA, improved programmability and reduced overall code length.
High computational power of GPUs (Graphics Processing Units) offers a promising accelerator for general-purpose computing. However, the need for dedicated programming environments has made the usage of GPUs rather complicated, and a GPU cannot directly execute binary code of a general-purpose application. This paper proposes a two-phase virtual execution environment (GXBIT) for automatically executing general-purpose binary applications on CPU/GPU architectures. GXBIT incorporates two execution phases. The first phase is responsible for extracting parallel hot spots from the sequential binary code. The second phase is responsible for generating the hybrid executable (both CPU and GPU instructions) for execution. This virtual execution environment works well for any applications that run repeatedly. The performance of generated CUDA (Compute Unified Device Architecture) code from GXBIT on a number of benchmarks is close to 63% of the hand-tuned GPU code. It also achieves much better overall performance than the native platforms.
Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.
Hybrid computing systems (incorporating FPGAs, GPUs, etc.) have received considerable attention recently as an approach to significant performance gains in many problem domains. Deploying applications on these systems, however, has proven to be difficult and very labor intensive. In this paper we review the current state of practice for application development on hybrid systems. We also present our vision of the application development languages and tools that we believe would greatly benefit the process of designing, implementing, and deploying applications on hybrid systems.
As technology scales below 32nm, manufacturers began to integrate both CPU and GPU cores in a single chip, i.e., single-chip heterogeneous processor (SCHP) to improve the throughput of emerging applications. In SCHPs, the CPU and the GPU share the total chip power budget while satisfying their own power constraints, respectively. Consequently, to maximize the overall throughput and/or power efficiency, both power budget and workload should be judiciously allocated to the CPU and the GPU. In this paper, we first demonstrate that optimal allocation of power budget and workload to the CPU and the GPU can provide 13% higher throughput than the optimal allocation of workload alone for a single-program workload scenario. Second, we also demonstrate that asymmetric power allocation considering per-program characteristics for a multi-programmed workload scenario can provide 9% higher throughput or 24% higher power efficiency than the even power allocation per program depending on the optimization objective. Lastly, we propose effective runtime algorithms that can determine near-optimal or optimal combinations of workload and power budget partitioning for both single- and multi-programmed workload scenarios; the runtime algorithms can achieve 96% and 99% of the maximum achievable throughput within 5-8 and 3-5 kernel invocations for single- and multi-programmed workload cases, respectively.

"Heterogeneous systems show a lot of promise for extracting highperformance
by combining the benefits of conventional architectures
with specialized accelerators in the form of graphics processors
(GPUs) and reconfigurable hardware (FPGAs). Extracting this
performance often entails programming in disparate languages and
models, making it hard for a programmer to work equally well on
all aspects of an application. Further, relatively little attention is
paid to co-execution—the problem of orchestrating program execution
using multiple distinct computational elements that work
seamlessly together.
We present Liquid Metal, a comprehensive compiler and runtime
system for a new programming language called Lime. Our
work enables the use of a single language for programming heterogeneous
computing platforms, and the seamless co-execution of
the resultant programs on CPUs and accelerators that include GPUs
and FPGAs.
We have developed a number of Lime applications, and successfully
compiled some of these for co-execution on various GPU and
FPGA enabled architectures. Our experience so far leads us to believe
the Liquid Metal approach is promising and can make the
computational power of heterogeneous a"
"Currently, computers can be composed of different Processing Units (PUs) - general-purpose and also programmable and specialist-purpose. One of the goals for such heterogeneity is to improve applications' performance. Particularly, scientific applications can highly benefit from this kind of platform. They produce large amounts of data within several types of algorithms, and distinct PUs are an alternative to better execute such tasks. This work presents a new system box - composed of CPU, GPU, and FPGA - to carry on site X-ray image evaluations. It was firstly tested by evaluating the performance of a Linear Integration (LI) algorithm over the PUs. This algorithm is largely used by synchrotron experiments in which high-speed X-ray cameras produce extremely large amounts of data for post-processing analysis, which includes performing LI. In our experiments, LI execution was around 30x faster in FPGA compared to CPU, achieving a similar performance to GPU. Taking the end-to-end application, i.e., image transfer into memory, this rate increases to hundreds. Issues for using FPGAs as a co-processor under our dynamic scheduling framework is also discussed. Synthesizing times for LI when assigned to FPGA are still too long for dynamic scheduling, preventing online synthesizing of functions not designed a priori.
"
"Heterogeneous multi-core platforms are increasingly prevalent due to their perceived superior performance over homogeneous systems. The best performance, however, can only be achieved if tasks are accurately mapped to the right processors. OpenCL programs can be partitioned to take advantage of all the available processors in a system. However, finding the best partitioning for any heterogeneous system is difficult and depends on the hardware and software implementation.
We propose a portable partitioning scheme for OpenCL programs on heterogeneous CPU-GPU systems. We develop a purely static approach based on predictive modelling and program features. When evaluated over a suite of 47 benchmarks, our model achieves a speedup of 1.57 over a state-of-the-art dynamic run-time approach, a speedup of 3.02 over a purely multi-core approach and 1.55 over the performance achieved by using just the GPU."
"This paper describes a heterogeneous computer cluster called
Axel. Axel contains a collection of nodes; each node can include
multiple types of accelerators such as FPGAs (Field
Programmable Gate Arrays) and GPUs (Graphics Processing
Units). A Map-Reduce framework for the Axel cluster
is presented which exploits spatial and temporal locality
through different types of processing elements and communication
channels. The Axel system enables the first demonstration
of FPGAs, GPUs and CPUs running collaboratively
for N-body simulation. Performance improvement from 4.4
times to 22.7 times has been achieved using our approach,
which shows that the Axel system can combine the benefits
of the specialization of FPGA, the parallelism of GPU, and
the scalability of computer clusters."
GPUs have recently evolved into very fast parallel co-processors capable of executing general purpose computations extremely efficiently. At the same time, multi-core CPUs evolution continued and today's CPUs have 4-8 cores. These two trends, however, have followed independent paths in the sense that we are aware of very few works that consider both devices cooperating to solve general computations. In this paper we investigate the coordinated use of CPU and GPU to improve efficiency of applications even further than using either device independently. We use Anthill runtime environment, a data-flow oriented framework in which applications are decomposed into a set of event-driven filters, where for each event, the runtime system can use either GPU or CPU for its processing. For evaluation, we use a histopathology application that uses image analysis techniques to classify tumor images for neuroblas-toma prognosis. Our experimental environment includes dual and octa-core machines, augmented with GPUs and we evaluate our approach's performance for standalone and distributed executions. Our experiments show that a pure GPU optimization of the application achieved a factor of 15 to 49 times improvement over the single core CPU version, depending on the versions of the CPUs and GPUs. We also show that the execution can be further reduced by a factor of about 2 by using our runtime system that effectively choreographs the execution to run cooperatively both on GPU and on a single core of CPU. We improve on that by adding more cores, all of which were previously neglected or used ineffectively. In addition, the evaluation on a distributed environment has shown near linear scalability to multiple hosts.
Hybrid systems with CPU and GPU have become new standard in high performance computing. Workload can be split and distributed to CPU and GPU to utilize them for data-parallelism in hybrid systems. But it is challenging to manually split and distribute the workload between CPU and GPU since the performance of GPU is sensitive to the workload it received. Therefore, current dynamic schedulers balance workload between CPU and GPU periodically and dynamically. The periodical balance operation causes frequent synchronizations between CPU and GPU. It often degrades the overall performance because of the overhead of synchronizations. To solve the problem, we propose a Co-Scheduling Strategy Based on Asymptotic Profiling (CAP). CAP dynamically splits and distributes the workload to CPU and GPU with only a few synchronizations. It adopts the profiling technique to predict performance and partitions the workload according to the performance. It is also optimized for GPU’s performance characteristics. We examine our proof-of-concept system with six benchmarks and evaluation result shows that CAP produces up to 42.7% performance improvement on average compared with the state-of-the-art co-scheduling strategies.
"Real-time optical mapping technology is a technique
that can be used in cardiac disease study and treatment technology
development to obtain accurate and comprehensive electrical
activity over the entire heart. It provides a dense spatial electrophysiology.
Each pixel essentially plays the role of a probe on
that location of the heart. However, the high throughput nature of
the computation causes significant challenges in implementing a
real-time optical mapping algorithm. This is exacerbated by high
frame rate video for many medical applications (order of 1000
fps). Accelerating optical mapping technologies using multiple
CPU cores yields modest improvements, but still only performs
at 3.66 frames per second (fps). A highly tuned GPU implementation
achieves 578 fps. A FPGA-only implementation is infeasible
due to the resource requirements for processing intermediate
data arrays generated by the algorithm. We present a FPGAGPU-CPU
architecture that is a real-time implementation of the
optical mapping algorithm running at 1024 fps. This represents
a 273× speed up over a multi-core CPU implementation."
"Fully utilizing the power of modern heterogeneous systems
requires judiciously dividing work across all of the available
computational devices. Existing approaches for partitioning
work require offline training and generate fixed partitions
that fail to respond to fluctuations in device performance
that occur at run time. We present a novel dynamic approach
to work partitioning that requires no offline training
and responds automatically to performance variability
to provide consistently good performance. Using six diverse
OpenCLTM applications, we demonstrate the effectiveness
of our approach in scenarios both with and without run-time
performance variability, as well as in more extreme scenarios
in which one device is non-functional."
"Mapping of applications onto Multiprocessor
System-on-Chip (MPSoC) can be realized either at design-time
or run-time. At any time the number of tasks executing in
MPSoC platform can exceed the available resources, requiring
efficient run-time mapping techniques to meet the real-time
constraints of the applications. This paper presents two runtime
mapping heuristics for mapping the tasks of an
application in close proximity so as to minimize the
communication overhead. In particular, the communication
overhead between two adjacent hardware tasks is eliminated
by mapping them onto the same reconfigurable processing
node. We show that the proposed approach is capable of
alleviating Network-on-Chip (NoC) congestion bottlenecks to
optimize the overall performance. Based on our investigations
to map the tasks of applications’ at run-time onto an 8×8 NoCbased
Heterogeneous MPSoC, our mapping heuristics are
capable of reducing total execution time and average channel
load of applications when compared to state-of-the-art runtime
mapping heuristics. "
"The graphics processing unit (GPU) has made significant
strides as an accelerator in parallel computing. However,
because the GPU has resided out on PCIe as a discrete device,
the performance of GPU applications can be bottlenecked by
data transfers between the CPU and GPU over PCIe. Emerging
heterogeneous computing architectures that “fuse” the functionality
of the CPU and GPU, e.g., AMD Fusion and Intel Knights
Ferry, hold the promise of addressing the PCIe bottleneck.
In this paper, we empirically characterize and analyze the
efficacy of AMD Fusion, an architecture that combines generalpurpose
x86 cores and programmable accelerator cores on the
same silicon die. We characterize its performance via a set of
micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks
(e.g., reduction), and actual applications (e.g., molecular dynamics).
Depending on the benchmark, our results show that Fusion
produces a 1.7 to 6.0-fold improvement in the data-transfer time,
when compared to a discrete GPU. In turn, this improvement in
data-transfer performance can significantly enhance application
performance. For example, running a reduction benchmark on
AMD Fusion with its mere 80 GPU cores improves performance
by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with
its 1600 more powerful GPU cores."
"As the system scales up continuously, the problem
of power consumption for high performance computing (HPC)
system becomes more severe. Heterogeneous system integrating
two or more kinds of processors, could be better adapted to
heterogeneity in applications and provide much higher energy
efficiency in theory. Many studies have shown heterogeneous
system is preferable on energy consumption to homogeneous
system in a multi-programmed computing environment. However,
how to exploit energy efficiency (Flops/Watt) of heterogeneous
system for a single application or even for a
single phase in an application has not been well studied. This
paper proposes a power-efficient work distribution method
for single application on a CPU-GPU heterogeneous system.
The proposed method could coordinate inter-processor work
distribution and per-processor’s frequency scaling to minimize
energy consumption under a given scheduling length constraint.
We conduct our experiment on a real system, which
equips with a multi-core CPU and a multi-threaded GPU.
Experimental results show that, with reasonably distributing
work over CPU and GPU, the method achieves 14% reduction
in energy consumption than static mappings for several typical
benchmarks. We also demonstrate that our method could
adapt to changes in scheduling length constraint and hardware
configurations."
"Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every recent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain-specific applications like scientific applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous computing systems where all their heterogeneous resources are continuously utilized by different applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power consumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component.
In this paper, we propose a novel predictive user-level scheduler based on past performance history for heterogeneous systems. We developed several scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple applications to fully utilize all available processing resources in CPU/GPU-like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode."
Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years. At present, the common approach to programming GPU units is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve good performance, it creates serious portability issues as programmers are required to write a specific version of the code for each potential target architecture. This results in high development and maintenance costs. We believe it is desirable to have a programming model which provides source code portability between CPUs and GPUs, as well as different GPUs. This would allow programmers to write one version of the code, which can be compiled and executed on either CPUs or GPUs efficiently without modification. In this paper, we propose MapCG, a MapReduce framework to provide source code level portability between CPUs and GPUs. In contrast to other approaches such as OpenCL, our framework, based on MapReduce, provides a high level programming model and makes programming much easier. We describe the design of MapCG, including the MapReduce-style high-level programming framework and the runtime system on the CPU and GPU. A prototype of the MapCG runtime, supporting multi-core CPUs and NVIDIA GPUs, was implemented. Our experimental results show that this implementation can execute the same source code efficiently on multi-core CPU platforms and GPUs, achieving an average speedup of 1.6 ~ 2.5x over previous implementations of MapReduce on eight commonly used applications.
"Heterogeneous multiprocessors are growingly important in the
multi-core era due to their potential for high performance and energy
efficiency. In order for software to fully realize this potential,
the step that maps computations to processing elements must be
as automated as possible. However, the state-of-the-art approach
is to rely on the programmer to specify this mapping manually
and statically. This approach is not only labor intensive but also
not adaptable to changes in runtime environments like problem
sizes and hardware configurations. In this study, we propose adaptive
mapping, a fully automatic technique to map computations to
processing elements on heterogeneous multiprocessors. We have
implemented it in our experimental heterogeneous programming
system called Qilin. Our results demonstrate that, for a set of important
computation kernels, automatic adaptive mapping achieves
a speedup of 9.3x on average over the best serial implementation
by judiciously distributing works over the CPU and GPU, which
is 69% and 33% faster than using the CPU or GPU alone, respectively.
In addition, adaptive mapping is within 94% of the speedup
of the best manual mapping found via exhaustive searching. To the
best of our knowledge, Qilin is the first and only system to date that
has such capability."
In this paper, we propose SnuCL, an OpenCL framework for heterogeneous CPU/GPU clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. The target cluster architecture consists of a designated, single host node and many compute nodes. They are connected by an interconnection network, such as Gigabit Ethernet and InfiniBand switches. Each compute node is equipped with multicore CPUs and multiple GPUs. A set of CPU cores or each GPU becomes an OpenCL compute device. The host node executes the host program in an OpenCL application. SnuCL provides a system image running a single operating system instance for heterogeneous CPU/GPU clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. SnuCL also provides collective communication extensions to OpenCL to facilitate manipulating memory objects. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.
"In the field of HPC, the current hardware trend is to design multiprocessor architectures
featuring heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE) or
data-parallel accelerators (e.g., GPUs).
Approaching the theoretical performance of these architectures is a complex issue.
Indeed, substantial efforts have already been devoted to efficiently offload parts of the
computations. However, designing an execution model that unifies all computing units
and associated embedded memory remains a main challenge.
We therefore designed StarPU, an original runtime system providing a high-level,
unified execution model tightly coupled with an expressive data management library.
The main goal of StarPU is to provide numerical kernel designers with a convenient
way to generate parallel tasks over heterogeneous hardware on the one hand, and easily
develop and tune powerful scheduling algorithms on the other hand.
We have developed several strategies that can be selected seamlessly at run-time,
and we have analyzed their efficiency on several algorithms running simultaneously over
multiple cores and a GPU. In addition to substantial improvements regarding execution
times, we have obtained consistent superlinear parallelism by actually exploiting the
heterogeneous nature of the machine. We eventually show that our dynamic approach
competes with the highly-optimized MAGMA library and overcomes the limitations of
the corresponding static scheduling in a portable way."

A Scheduling and Runtime Framework for a Cluster of Heterogeneous Machines with Multiple Accelerators
An integer linear programming model for mapping applications on hybrid systems
Automatic task mapping and heterogeneity-aware fault tolerance: The benefits for runtime optimization and application development
Compositional Performance Analysis of Component-Based Systems on Heterogeneous Multiprocessor Platforms
Consolidating Applications for Energy Efficiency in Heterogeneous Computing Systems
Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems
Energy-efficient allocation of real-time applications onto Heterogeneous Processors
Resource-awareness on heterogeneous MPSoCs for image processing
Scheduling multi-paradigm and multi-grain parallel components on heterogeneous platforms
Scheduling tradeoffs for heterogeneous computing on an advanced space processing platform
A design flow for partially reconfigurable heterogeneous multi-processor platforms
A domain-specific language to facilitate software defined radio parallel executable patterns deployment on heterogeneous architectures
A fast algorithm for constructing inverted files on heterogeneous platforms
A Federated Simulation Environment for Hybrid Systems
A methodology for the design and deployment of reliable systems on heterogeneous platforms
A new era in scientific computing: Domain decomposition methods in hybrid CPU–GPU architectures
A Simulation Framework for Efficient Resource Management on Hybrid Systems
A Stream Processing Framework for On-Line Optimization of Performance and Energy Efficiency on Heterogeneous Systems
Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU
Accelerating DynEarthSol3D on tightly coupled CPU–GPU heterogeneous processors
Accelerating Spectral Calculation through Hybrid GPU-Based Computing
AKI: Automatic Kernel Identification and Annotation Tool Based on C++ Attributes
An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration
An extended model for multi-criteria software component allocation on a heterogeneous embedded platform
Architecture Aware Resource Allocation for Structured Grid Applications: Flood Modelling Case
ASTRO: Synthesizing application-specific reconfigurable hardware traces to exploit memory-level parallelism
Auto-tuning techniques for linear algebra routines on hybrid platforms
Automatic OpenCL Code Generation for Multi-device Heterogeneous Architectures
Automatic synthesis of embedded SW for evaluating physical implementation alternatives from UML/MARTE models supporting memory space separation
Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer
Biological sequence comparison on hybrid platforms with dynamic workload adjustment
Bridging the gap between performance and bounds of cholesky factorization on heterogeneous platforms
Combining multi-core and GPU computing for solving combinatorial optimization problems
Compiling for power with ScalaPipe
Component allocation optimization for heterogeneous CPU-GPU embedded systems
Cost efficient synthesis of real-time systems upon heterogeneous multiprocessor platforms
CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters
Cross Resource Optimisation of Database Functionality across Heterogeneous Processors
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory
dOpenCL: Towards uniform programming of distributed heterogeneous multi-/many-core systems
Dynamic load balancing for real-time video encoding on heterogeneous CPU+GPU systems
Dynamic Load Balancing of the Adaptive Fast Multipole Method in Heterogeneous Systems
Dynamic Load Scheduling on CPU-GPU for Iterative Tomographic Reconstruction
Dynamic Reconfiguration of Tasks Applied to an UAV System Using Aspect Orientation
Dynamic runtime optimizations for systems of heterogeneous architectures
Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters
Dynamic Self-Rescheduling of Tasks over a Heterogeneous Platform
Efficient breadth-first search on a heterogeneous processor
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
Efficient interaction between OS and architecture in heterogeneous platforms
Efficient Parallelization of a Two-List Algorithm for the Subset-Sum Problem on a Hybrid CPU/GPU Cluster
Energy minimization for periodic real-Time tasks on heterogeneous processing units
Enhancing Design Space Exploration by Extending CPU/GPU Specifications onto FPGAs
EXOCHI: Architecture and programming environment for a heterogeneous multi-core multithreaded system
FEVES: Framework for Efficient Parallel Video Encoding on Heterogeneous Systems
Heterogeneity impact on MPSoC platforms performance
Heterogeneous tasks and conduits framework for rapid application portability and deployment
Hierarchical partitioning algorithm for scientific computing on highly heterogeneous CPU + GPU clusters
High-performance, energy-efficient platforms in-socket FPGA accelerators
Hybrid computing: CPU+GPU co-processing and its application to tomographic reconstruction
Implementing molecular dynamics on hybrid high performance computers – short range forces
Improving application behavior on heterogeneous manycore systems through kernel mapping
Load-Prediction Scheduling for Computer Simulation of Electrocardiogram on a CPU-GPU PC
Managing GPU Concurrency in Heterogeneous Architectures
Mapping a data-flow programming model onto heterogeneous platforms
Mapping and scheduling of parallel C applications with Ant Colony Optimization onto heterogeneous reconfigurable MPSoCs
Mapping parallel programs to heterogeneous CPU/GPU architectures using a Monte Carlo Tree Search
Multi-criteria software component allocation on a heterogeneous platform
On Load Balancing of Hybrid OpenCL/Global Arrays Applications on Heterogeneous Platforms
On the Provision of SaaS-Level Quality of Service within Heterogeneous Private Clouds
One OpenCL to rule them all?
OpenCL vs OpenACC: Lessons from Development of Lattice QCD Simulation Code
OpenCL-based optimization methods for utilizing forward DCT and quantization of image compression on a heterogeneous platform
Optimal Performance Prediction of ADAS Algorithms on Embedded Parallel Architectures
Optimized FFT computations on heterogeneous platforms with application to the Poisson equation
Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators
Performance study of interference on GPU and CPU resources with multiple applications
Performance-Power Design Space Exploration in a Hybrid Computing Platform Suitable for Mobile Applications
PLB-HeC: A Profile-Based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping
Power-Aware Job Scheduling on Heterogeneous Multicore Architectures
Rapid heterogeneous prototyping from Simulink
Real-time task reconfiguration support applied to an UAV-based surveillance system
Run-time adaptation to heterogeneous processing units for real-time stereo vision
Run-Time Management of a MPSoC Containing FPGA Fabric Tiles
Runtime Adaptation for Autonomic Heterogeneous Computing
Runtime Resource Management in Heterogeneous System Architectures: The SAVE Approach
Self-adaptive harris corner detector on heterogeneous many-core processor
Self-Adaptive OmpSs Tasks in Heterogeneous Environments
SHEPARD: scheduling on heterogeneous platforms using application resource demands
Sm@rtConfig: A context-aware runtime and tuning system using an aspect-oriented approach for data intensive engineering applications
Static partitioning and mapping of kernel-based applications over modern heterogeneous architectures
Supporting Low-Latency CPS Using GPUs and Direct I/O Schemes
Towards a Common Software-to-Hardware Allocation Framework for the Heterogeneous High Performance Computing
Trellis: Portability across architectures with a high-level framework
Two-phase execution of binary applications on CPU/GPU machines
Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment
Visions for application development on hybrid computing systems
Workload-Aware Optimal Power Allocation on Single-Chip Heterogeneous Processors
A Compiler and Runtime for Heterogeneous Computing
A CPU, GPU, FPGA System for X-Ray Image Processing Using High-Speed Scientific Cameras
A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL
Axel: A Heterogeneous Cluster with FPGAs and GPUs
Coordinating the use of GPU and CPU for improving performance of compute intensive applications
CPU + GPU scheduling with asymptotic profiling
FPGA-GPU-CPU Heterogenous Architecture for Real-time Cardiac Physiological Optical Mapping
Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability
Mapping Algorithms for NoC-based Heterogeneous MPSoC Platforms
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
Power-efficient Work Distribution Method for CPU-GPU Heterogeneous System
Predictive Runtime Code Scheduling for Heterogeneous Architectures
Providing Source Code Level Portability Between CPU and GPU with MapCG
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

graphics processing units;parallel processing;resource allocation;scheduling;CPU+GPU cluster programming;accelerator;data transfer;graphics processing unit;heterogeneous machine;high-performance computing;load balancing;runtime framework;scheduling framework;task allocation;transaction style bulk-synchronous semantics;Kernel;Message systems;Programming;Runtime;Subscriptions;Heterogeneous Architectures;High Performance Computing;Hybrid CPU-GPU Clusters;Multi Scheduling;Work Stealing
application specific integrated circuits;field programmable gate arrays;integer programming;linear programming;microprocessor chips;resource allocation;scheduling;ASIC;FPGA;M-JPEG encoder;hybrid systems;integer linear programming model;memory requirements;microprocessors;resource sharing;scheduling problem
Heterogeneous computing;Accelerators;Fault-tolerant systems;Runtime systems
formal specification;message passing;multiprocessing systems;object-oriented programming;software performance evaluation;MPEG-4 decoder;abstract specification;asynchronous message passing interaction;car navigation system;component-based systems;composable software;compositional performance analysis;executable system model;hardware component model;multiprocessor architecture;operational semantics;simulation-based analysis;Analytical models;Computer architecture;Delay;Hardware;MPEG 4 Standard;Performance analysis;Predictive models;Software performance;Throughput
distributed processing;energy conservation;resource allocation;scheduling;ASJF policy;EEA scheduling policy;RR policy;RUA policy;adaptive shortest-job first policy;energy consumption;energy efficiency;energy efficiency-aware scheduling policy;energy-delay product;heterogeneous computing systems;resource requirements;resource utilization;resource utilization-aware policy;round-robin policy;Algorithm design and analysis;Benchmark testing;Graphics processing units;Kernel;Power demand;Schedules;CPU;GPU;benchmarks;consolidation;heterogeneous computing
OP2;Domain specific language;Active library;Unstructured mesh;GPU;Heterogeneous systems
convex programming;energy consumption;microprocessor chips;multiprocessing systems;power aware computing;convex optimization problem;dynamic power management;dynamic voltage;energy-efficient allocation;energy-efficient hardware design;frequency scaling;heterogeneous multicore processors;load distribution strategy;real-time software component;self-powered vehicle;task assignment heuristic;Energy efficiency
Resource awareness;Invasive Computing;Image processing;Computer vision;MPSoC;Heterogeneous processor
CCA;multi-grain;multi-paradigm;parallel component;performance prediction;resource management
aerospace computing;distributed processing;resource allocation;scheduling;NASA New Millennium Program;advanced space processing platform;embedded space systems;gang scheduling;heterogeneous computing;heterogeneous processing system;networked processors;opportunistic load-balancing;real-time systems;spaceborne platforms;spacecraft systems;Delta modulation;Embedded computing;Field programmable gate arrays;High performance computing;NASA;Power generation;Processor scheduling;Real time systems;Space missions;Space vehicles;Embedded Space;Preemption;Systems
logic design;multiprocessing systems;reconfigurable architectures;resource allocation;system-on-chip;application throughput requirement;configuration-optimization;design flow;energy consumption;fully reconfigurable design;homogeneous design;multiprocessor systems-on-chip;partially reconfigurable heterogeneous multiprocessor platforms;real-life applications;synthetic benchmarks;use-cases;Application specific integrated circuits;Delay;Pareto optimization;Switches;Throughput;Partially reconfigurable systems;design-flow;heterogeneous systems;multiple use-cases
FIR filters;Hilbert transforms;convolution;fast Fourier transforms;parallel programming;program compilers;programming languages;signal processing;software radio;spectral analysis;telecommunication computing;DSL compiler framework;DSP algorithms;DSP routines;DSP-FPGA architectures;Delite;FIR filter;OptiSDR;SDR HCA;cross-correlation;digital signal processing;domain-specific language;heterogeneous computing architectures;high-level source code;hybrid GPU-CPU;parallel executable patterns deployment;programming language;programming paradigms;software defined radio;Computer architecture;DSL;Finite impulse response filters;Graphics processing units;Programming
GPU;Indexer;Inverted files;Multicore;Pipelined and parallel parsing and indexing
hybrid simulation;programming languages;Auto-Pipe;VERITAS gamma-ray astronomy project;X-sim;data flow coordination language X;federated simulation environment;hybrid computing systems;Astronomy;Computational modeling;Computer science;Computer simulation;Costs;Data communication;Field programmable gate arrays;Hardware design languages;High performance computing;Signal processing
embedded systems;field programmable gate arrays;performance evaluation
Hybrid computing;Multi-core processing;Many-core processing;Graphics processing units;Domain decomposition methods;FETI method
graphics processing units;multiprocessing systems;resource allocation;scheduling;GPU;H-system hardware configuration;H-system software configuration;OpenCL language;graphical processing units;hybrid system;manycore architecture;multicore architecture;resource management;simulation framework;Hardware;Instruction sets;Kernel;Scheduling algorithms;CPU;Heterogeneous Computing;Hybrid systems;OpenCL;Simulation
parallel programming;power aware computing;processor scheduling;public domain software;software libraries;C++ library;GEMM;GPU;compute-intensive programs;data element;data flow;data partitioning;demand-based allocation;general matrix multiplication;heterogeneous systems;high-performance computing applications;indeterministic memory access;libHawaii;low-latency maintenance;memory transfers;online energy efficiency optimization;online performance optimization;open source library;overlapping computations;pipelining;processing units;processor performance;processor workloads;sparse image feature detection;sparse image feature extraction;sparse image feature matching;stream processing framework;synchronization;task parallelism;varying processor inputs;Graphics processing units;Libraries;Pipeline processing;Throughput;energy efficiency;heterogeneous computing;load balancing;real-time;stream processing
Radiation therapy;Monte Carlo;GPU;CUDA;Proton tracks;Microdosimetry
Computational Tectonic Modeling;Long-term lithospheric deformation;Heterogeneous computing;GPGPU;Parallel computing
astronomical techniques;astronomy computing;graphics processing units;parallel processing;resource allocation;APEC;Astrophysical Plasma Emission Code;CPU architecture;GPU architecture;GPU-optimized approach;astrophysics;hybrid GPU-based computing;load balance strategy;one-dimensional numerical integrations;process-level parallelism;spectral analysis;spectral calculation;spectral toolset;three-dimensional parameter space;Acceleration;Computational modeling;Computer architecture;Load management;Load modeling;GPU;hybrid architecture;load balancing;numerical integration
C++ language;parallel architectures;source code (software)
Conjugate gradient method;Energy awareness;Energy model;Execution time model;RAPL;GPU
decision making;genetic algorithms;software engineering
application program interfaces;graphics processing units;grid computing;message passing;parallel processing;resource allocation;CPU cores;GPU;MPI parallelization architecture;architecture aware resource allocation;distributed memory;flood modelling case;heterogeneous compute nodes;heterogeneous processing resources;high performance computing;load balance;local search algorithms;scientific modelling software;structured grid applications;structured mesh;Computational modeling;Computer architecture;Load modeling;Partitioning algorithms;Predictive models;CUDA;Domain Decomposition;Flood Modelling;Local Search Algorithm;MPI;OpenMP;Overland Flow Models
Memory-level parallelism;FPGA;Application-specific;Reconfigurable;Memory structure;HLS
linear algebra;matrix decomposition;matrix multiplication;multiprocessing programs
graphics processing units;parallel processing;program compilers;resource allocation;specification languages;GPU;STEPOCL;Xeon Phis;automatic OpenCL code generation;data parallel applications;domain specific language;dynamic load balancing;interaccelerator data movements;multidevice heterogeneous architectures;programming tool;Arrays;Hardware;Kernel;Performance evaluation;Programming;Synchronization;Writing;Accelerators;Code generation;Heterogeneous architectures;OpenCL
Design space exploration;Heterogeneity;MARTE;Software synthesis;System on chip;UML
computational fluid dynamics;flow simulation;graphics processing units;parallel machines;resource allocation;CPU-GPU collaborative high-order CFD simulations;CPU-GPU collaborative high-order accurate aerodynamic simulation;GPU accelerated supercomputers;HOSTA;Tianhe-1A supercomputer;complex grid geometry;in-house high-order CFD software;large scale high-order CFD simulations;load balancing;massive HPC resources;maximum simulation problem size per Tianhe-1A node;naive GPU-only approach;simulated China large civil airplane configuration C919;store-poor GPU;store-rich CPU;Collaboration;Computational modeling;Kernel;Memory management;Performance evaluation;CFD;CPU-GPU collaboration;GPU parallelization;high-order finite difference scheme
bioinformatics;GPUs;multicores;smith-waterman
graphics processing units;matrix decomposition;matrix multiplication;performance evaluation;processor scheduling;resource allocation
Multi-core computing;GPU accelerators;Parallel branch-and-bound;Flowshop scheduling problem
FPGA;Embedded systems;Energy;High-level synthesis
embedded systems;formal specification;graphics processing units;integer programming;multiprocessing systems;performance evaluation;resource allocation;software architecture
computational complexity;costing;embedded systems;multiprocessing systems;processor scheduling;resource allocation;NP-hard problem;approximation algorithm;embedded system;heterogeneous multiprocessor platform;minimum cost efficient synthesis;processing unit;real-time system;Approximation algorithms;Central Processing Unit;Coprocessors;Cost function;Digital signal processing chips;Embedded software;Graphics;Optimized production technology;Real time systems
application program interfaces;graphics processing units;message passing;parallel architectures;resource allocation;CPU+GPU programming;CUDA;GPU cluster;MPI;OpenMP;graphics processing unit;resource efficiency;stencil computation;Central Processing Unit;Kernel;Message systems;Parallel processing;Programming;Three-dimensional displays;CPU+GPU computing;GPU;MPI. CPU;Stencil
cloud computing;database management systems;field programmable gate arrays;graphics processing units;multiprocessing systems;optimisation;performance evaluation;power aware computing;FPGA;GPU;HARNESS project;SAP HANA database management function;SHEPARD;application performance improvements;architectural principles;associated cost profiles;compute devices;cross resource optimisation;database functionality;energy consumption;enterprise-level software;heterogeneous computing devices;heterogeneous processors;multicore CPU;next generation cloud platforms;nonOpenCL devices;run-time allocation;Databases;Dictionaries;Kernel;Resource management;Runtime;SAP HANA;heterogeneous computation;in-memory database

OpenCL;Heterogeneous systems;Distributed systems;GPU computing;dOpenCL;Multi-cores;Many-cores
GPGPU;Hybrid CPU+GPU System;Load Balancing;Video Coding
application program interfaces;graphics processing units;parallel processing;physics computing;resource allocation;shared memory systems;tree data structures;AFMM parallelization;CPU cores;GPU accelerators;OpenMP task parallelism;adaptive fast multipole method;adaptive spatial decomposition tree;all-pairs computations;cost model;dynamic load balancing;heterogeneous shared memory compute node;heterogeneous systems;incremental adjustment strategy;multicore processors;near-field interactions;serial computation;time-dependent problems;Adaptation models;Computational modeling;Load management;Load modeling;Octrees;CUDA;accelerators;hybrid computing
graphics processing units;iterative methods;multiprocessing systems;processor scheduling;resource allocation;tomography;CPU-GPU;HPC field;adaptive load balancing technique;computing power;dynamic load scheduling;heterogeneous platform;hybrid computing approach;iterative tomographic reconstruction;modern computer;multicore processor;on-demand strategy;workload distribution;Central Processing Unit;Computers;Graphics processing unit;Instruction sets;Multicore processing;Slabs
military aircraft;object-oriented programming;remotely operated vehicles;resource allocation;application requirement;aspect orientation paradigm;computational resource;crosscut concern;dynamic task reconfiguration;hybrid multicore desktop architecture;nonfunctional application timing constraint;reconfigurable computing;run-time reconfigurable load-balancing framework;unmanned aerial vehicle system;Application software;Computer architecture;Graphics;Hardware;Monitoring;Runtime;Surveillance;Timing;Unmanned aerial vehicles;Vehicle dynamics
embedded systems;energy conservation;power aware computing;resource allocation;scheduling;FPGA;baseline greedy scheduler;battery life maximization;computational subsystems;compute resources;dynamic real-time application mapping;dynamic runtime optimization;dynamic task scheduling;energy constraint;energy consumption minimization;energy efficiency;general-purpose processors;graphics processing units;heterogeneous architecture;heterogeneous resources;power consumption;task performance maximization;Computer architecture;Dynamic scheduling;Energy consumption;Optimization;Processor scheduling;Runtime;Schedules
Monte Carlo methods;autoregressive processes;computer graphic equipment;coprocessors;field programmable gate arrays;pricing;scheduling;16 AMD Phenom 9650 quad-core 2.4GHz CPU;8 Tesla C1060 GPU;8 Virtex-5 xc5vlx330t FPGA;GARCH asset simulation;asset simulation;dynamic scheduling Monte-Carlo simulation framework;efficient allocation line;generalized autoregressive conditional heteroskedasticity model;multiaccelerator heterogeneous clusters;option pricing;Computational modeling;Graphics processing unit;Hardware;Kernel;Mathematical model;Processor scheduling
distributed processing;object-oriented programming;aspect-oriented paradigms;computational resources;dynamic tasks self-rescheduling;heterogeneous cluster;heterogeneous execution platform;high-performance computer system;load-balancing;self-adaptive computing;task allocation decisions;Application software;Digital signal processing;Field programmable gate arrays;Hardware;Radar imaging;Runtime;Signal processing algorithms;Timing;Unmanned aerial vehicles;Vehicle dynamics
graphics processing units;multi-threading;multiprocessing systems;tree searching;APU;CPU;GPU cores;accelerated processing unit;breadth-first search;data-parallel execution;heterogeneous processor;hybrid++ BFS algorithm;serial execution;Central Processing Unit;Heuristic algorithms;Instruction sets;Kernel;Parallel processing;Partitioning algorithms;Accelerated Processing Unit (APU);Breadth-first Search (BFS);GPU;Graph Traversal;Graph500;Heterogeneous System Architecture (HSA);Hybrid;OpenCL&#x2122
brain;cancer;distributed memory systems;graphics processing units;medical image processing;parallel processing;scheduling;CPU equipped cluster systems;GPU equipped cluster systems;Intel Xeon Phi;MIC equipped cluster systems;Masc;accelerators;brain cancer morphology;distributed memory machine;hierarchical dataflow tasks;high performance computing;microscopy image analysis;pathology image analysis application;performance aware scheduling techniques;task allocation;Central Processing Unit;Image analysis;Microwave integrated circuits;Performance evaluation;Processor scheduling
Tridiagonal solver;Linear algebra;GPU;Accelerator;Heterogeneous execution;Memory management

application program interfaces;computational complexity;graphics processing units;message passing;parallel algorithms;parallel architectures;MPI-CUDA parallel implementation;cluster configuration;hybrid CPU-GPU cluster;hybrid MPI-OpenMP implementation;sequential CPU implementation;single-node CPU-only implementation;single-node GPU-only implementation;subset-sum problem;two-list algorithm;workload distribution scheme;Clustering algorithms;Computational modeling;Educational institutions;Instruction sets;Parallel processing;Performance evaluation;MPI-CUDA implementation;hybrid CPU/GPU cluster;knapsack problem
Heterogeneous processing units;Power-aware design;Processing unit allocation;Real-time systems;Task partitioning

GPU;Heterogeneous multi-cores;OpenMP
graphics processing units;optimisation;parallel processing;resource allocation;scheduling;video coding;CPU architecture;FEVES;GPU architecture;adaptive scheduling;automatic data access management;autonomous unified video encoding framework;collaborative interloop encoding;encoding library;heterogeneous desktop systems;heterogeneous systems;high performance computing;hybrid multicore CPU;interloop modules;load balancing strategies;multiGPU platforms;parallel video encoding;simultaneous execution control;unified optimization problem;video content;Complexity theory;Encoding;Load management;Performance evaluation;Streaming media;GPGPU;Load Balancing
embedded systems;field programmable gate arrays;logic design;multiprocessing systems;parallel processing;performance evaluation;system-on-chip
C++ language;computerised tomography;graphics processing units;image reconstruction;medical image processing;microprocessor chips;Monte Carlo methods;multiprocessing systems;parallel architectures;software libraries
graphics processing units;matrix multiplication;multiprocessing systems;parallel processing;resource allocation;scientific information systems
Agility;FPGA;In-socket accelerator
CPU;GPU;Hybrid computing;CPU–GPU co-processing;High performance computing;Electron tomography;Tomographic reconstruction
Molecular dynamics;GPU;Hybrid parallel computing
Mixed Integer Programming;Kernel mapping;Heterogeneous systems
electrocardiography;multiprocessing systems;resource allocation;scheduling;CPU-GPU PC;GPU performance;LPS algorithm;computer simulation;dynamic scheduling overhead;electrocardiogram;load-prediction scheduling;perfect load balancing;personal supercomputer;pure self-scheduling;Computational modeling;Dynamic scheduling;Graphics processing units;Instruction sets;Processor scheduling;Computer simulation of ECG;GPU
concurrency control;graphics processing units;multi-threading;parallel architectures;performance evaluation;resource allocation;CM-BAL;CM-CPU;CPU performance;GPU core state;GPU interference;GPU performance;GPU-based concurrency management techniques;TLP;general-purpose CPU;heterogeneous architectures;heterogeneous systems;homogeneous architectures;integrated concurrency management strategy;network congestion information;resource utilization;shared hardware resources;shared resource interference minimization;system-wide memory;thread-level parallelism;throughput-optimized GPU;Bandwidth;Central Processing Unit;Computer architecture;Concurrent computing;Resource management;System performance;CPU-GPU;GPUs;concurrency;scheduling

directed graphs;optimisation;processor scheduling;reconfigurable architectures;system-on-chip;directed acyclic graph;hardware-software partitioning;heterogeneous reconfigurable MPSoC;hierarchical task graphs;parallel C mapping;parallel C scheduling;reconfigurable multiprocessor systems-on-chip platforms;task allocation;Ant colony optimization;Dynamic scheduling;Embedded system;Feedback;Field programmable gate arrays;Hardware;Multiprocessing systems;Partitioning algorithms;Runtime;Scheduling algorithm
Monte Carlo methods;graphics processing units;parallel programming;resource allocation;trees (mathematics);Monte Carlo tree search;central processing unit;graphics processing unit;heterogeneous CPU-GPU architecture;high-level design pattern;parallel hardware;parallel program mapping;parallelism;programming practice;programming technique;resource mapping;single core processor;well-specified parallel skeleton;Computer architecture;Convolution;Hardware;Pipelines;Skeleton;Heterogeneous Architecture;Heuristic Algorithm;Montecarlo Tree Search;Static Mapping
object-oriented methods;software engineering
parallel programming;resource allocation;shared memory systems
cloud computing;field programmable gate arrays;quality of experience;regression analysis;resource allocation;scheduling;virtual machines;FPGA accelerators;GPU;QoS;SaaS-level quality-of-service;VM;accelerator resources;business value;cloud environments;computing capacity;computing resource utilization;heterogeneous cloud computing platforms;heterogeneous hardware architecture;heterogeneous private clouds;heterogeneous-aware resource allocation algorithm;heterogeneous-aware resource scheduling algorithm;heuristic algorithm;latency;linear regression model;multicore CPU;power consumption;software-as-a-service model;user requirements;virtual resource management;Catalogs;Hardware;Quality of service;Resource management;Software as a service;FPGAs;Heterogeneous Resources;SaaS
graphics processing units;program compilers;program diagnostics;software portability
Lattice gauge theory;Accelerator;OpenCL;OpenACC
Forward DCT;Heterogeneous computing;Image compression;OpenCL;Parallel image processing;Quantization
driver information systems;embedded systems;image processing;parallel architectures;performance evaluation;system-on-chip;ADAS algorithms;SoC;advanced driver assistance system algorithms;automotive industry;confidence degree;embedded parallel architectures;embedding complex algorithms;execution time interval;heterogeneous architectures;image processing operations;massively-parallel computing unit;optimal performance prediction;processing units;Computational modeling;Computer architecture;Graphics processing units;Kernel;Parallel processing;Prediction algorithms;Performance Prediction
CUDA GPU;Fast Fourier transforms;Parallel and vector implementations;Poisson equations
MPDATA advection algorithm;Stencil computation;GPU accelerators;Hybrid CPU–GPU architectures;Hierarchical decomposition;Autotuning
computer graphic equipment;resource allocation;CPU resources;GPU;entertainment market;graphics processing units;immense computing power;job scheduler;Application software;Central Processing Unit;Computer applications;Computer graphics;High performance computing;Interference;Job design;Performance analysis;Processor scheduling;Resource management
Linux;computer architecture;computer network security;field programmable gate arrays;mobile computing;operating systems (computers);performance evaluation;resource allocation;CPU-based hybrid architecture;FPGA based hardware accelerators;FPGA-based hybrid architecture;Linux OS;hybrid computing platform;mobile applications;operating system;performance-power design space exploration;performance-power implication evaluaton;resource management;robust security;signal processing algorithm;Clocks;Finite impulse response filter;Hardware;Rails;FPGA;Hardware accelearation;Hybrid computing;Performance-power tradeoffs
graphics processing units;multiprocessing systems;resource allocation;CPU processor;GPU processor;NP-hard problem;PLB-HeC algorithm;StarPU framework;bioinformatics;central processing unit;graphics processing unit;heterogeneous CPU-GPU cluster;heterogeneous clusters;linear algebra;machine configurations;profile-based load-balancing algorithm;stock markets;Clustering algorithms;Computational modeling;Heuristic algorithms;Load modeling;Mathematical model;GPGPU;GPU clusters;load-balancing;parallel computing
graphics processing units;multiprocessing systems;power aware computing;power consumption;resource allocation;CPU-GPU heterogeneous systems;DVFS-task mapping coordination;application execution;computer system;cooling system;data-parallel application;device frequency;frequency scaling;load imbalance;near-optimal settings;power budget;power capping technique;power delivery;power management technique;power violation;single computing node;system power consumption;Computational modeling;Equations;Kernel;Mathematical model;Performance evaluation;Power demand;DVFS;GPGPU;Power Capping;Task Mapping
graphics processing units;multiprocessing systems;power aware computing;resource allocation;scheduling;adjustable power state software services;computing cost reduction;computing workload distribution;heterogeneous CPU-GPU architectures;heterogeneous multicore architectures;peak power reduction;power-aware job scheduling;service cost reduction;system cost reduction;Current measurement;Kernel;Power demand;Power measurement;Scheduling algorithms;Power management;multi-GPU;power capping;prediction
integrated circuit design;multiprocessing systems;software prototyping;system-on-chip
aerospace computing;aerospace robotics;mobile robots;remotely operated vehicles;resource allocation;surveillance;UAV-based surveillance system;high-performance computer system platform;real-time task reconfiguration;reconfigurable computing;reconfigurable load-balancing;task allocation;unmanned aerial vehicles;Real time systems
gesture recognition;graphics processing units;parallel processing;resource allocation;robot vision;smart phones;stereo image processing
embedded systems;field programmable gate arrays;multimedia systems;multiprocessing systems;parallel processing;reconfigurable architectures;resource allocation;system-on-chip;FPGA fabric tiles;MPSoC;configuration hierarchy handling;embedded computing platforms;fine-grain reconfigurable hardware processing elements;multimedia applications;multiple heterogeneous processing elements;multiprocessor system-on-chip;programmable softcore;resource assignment decisions;run-time management;run-time task assignment;Decoding;Embedded computing;Fabrics;Hardware;Resource management;Runtime;Configuration hierarchy;Heuristic;multiprocessor system-on-chip (MPSoC);softcore;task assignment
application program interfaces;fault tolerant computing;message passing;resource allocation;GPU;MPI applications;NUMA systems;OS noise;accelerated OpenMP;automated memory transformations;autonomic heterogeneous computing;general purpose computing;graphics processing unit;heterogeneous compute resource management;heterogeneous hardware;hierarchical caching;load balancing;memory access;memory coherence;message passing interface;operating systems;resource contention;runtime adaptation;topology aware affinity management;Acceleration;Graphics processing units;Hardware;Parallel processing;Programming;Runtime;Schedules;OpenMP;scheduling
resource allocation;virtualisation;SAVE approach;heterogeneous system architecture;performance-energy trade-off;processing elements;resource allocation policy;runtime resource management;self-adaptive systems;self-adaptive virtualisation-aware high- performance-low-energy heterogeneous system architecture;system resources;Actuators;Computer architecture;Graphics processing units;Monitoring;Optimization;Resource management;Runtime;Heterogeneous systems;Virtualization
computer vision;edge detection;graphics processing units;microprocessor chips;multiprocessing systems;resource allocation;HSA;computer vision applications;heterogeneous many-core processor;heterogeneous system architecture;on-chip CPU units;on-chip GPU units;processing intervals;resource sharing;resource-aware runtime-system;resource-awareness;self organisation;self-adaptive Harris corner detector;software development;Central Processing Unit;Computer architecture;Detectors;Hardware;Parallel processing;Runtime
graphics processing units;parallel programming;resource allocation;scheduling;source coding;application performance;application programmer;application source code;computational power;hardware accelerators;heterogeneous environments;heterogeneous systems;high performance computers;multiGPU system;resource management;runtime parallelism exploitation;self-adaptive OmpSs tasks;sequential applications;task-based programming model;Computer architecture;Kernel;Programming;Proposals;Reliability;Runtime;heterogeneous architectures;multi-gpu management;parallel programming models;scheduling techniques
multiprocessing systems;parallel processing;processor scheduling;program compilers
Scheduling;Load-balancing;GPUs;Solvers for linear equations systems;Model-Driven Engineering;Aspect Oriented Software Development
Heterogeneous computing;Kernel partitioning;Parallel computing
Linux;concurrency control;data communication;graphics processing units;multiprocessing systems;parallel processing;public domain software;real-time systems;resource allocation;GPGPU;Linux operating system;concurrent Gdev development;data communication mechanisms;data throughput;data transfer;direct I/O schemes;general purpose GPU systems;general purpose parallel computing;heterogeneous computing system;latency-sensitive cyber-physical systems;low-latency CPS;many-core device resource management;multicore CPU systems;open source project;real-time cyber-physical applications;Central Processing Unit;Computer architecture;Graphics processing unit;Kernel;Random access memory;Throughput;GPU communication;real time systems
parallel processing;resource allocation;software architecture;deployment architecture;general allocation model;hardware configurations;heterogeneous distributed processing units;heterogeneous high performance computing;optimization framework;software configurations;software deployment;software-to-hardware allocation framework;two-step allocation algorithm;Computational modeling;Data models;Optimization;Partitioning algorithms;Processor scheduling;Resource management;Software;Allocation;Graph application model;Heterogeneous platform;Mapping
Parallel computation;Parallel frameworks;Parallel architectures;Loop mapping

data flow computing;graphics processing units;linear algebra;mathematics computing;parallel programming;resource allocation;GPU resources;Intel Xeon Phi coprocessors;Intel coprocessors;accelerator hardware;dataflow control;dense linear algebra applications;heterogeneous resources;lightweight runtime environment;mixed multiGPU multicoprocessor environments;multicore-CPU;multiuser environments;parallel execution;resource-specific workload management;serial code;task abstractions;two-way hybrid systems;unified algorithmic development;workloads handling;Coprocessors;Hardware;Multicore processing;Programming;Runtime environment;dense linear algebra;hardware accelerators;runtime scheduling
Coordination languages;Chip multiprocessors;Field-programmable;Logic arrays;Graphics processing units
Benchmark testing;Graphics processing units;Kernel;Resource management;Runtime;Throughput;GPU;Single-chip heterogeneous processor;dynamic voltage and frequency scaling

Heterogeneous, GPU, FPGA, Streaming, Java
Dynamic scheduling, FPGA, GPU, Heterogeneous systems, High-speed x-ray cameras, Linear integration
Heterogeneous programming task partitioning OpenCL parallel programming static code analysis




Heterogeneous scheduling, load balancing, GPU, OpenCL
"Multiprocessor System-on-Chip (MPSoC) Design,
Network-on-Chip (NoC), Run-time mapping, Mapping
Algorithms"
"AMD Fusion; graphics processing unit; GPU;
GPGPU; accelerated processing unit; APU; OpenCL; performance
evaluation; benchmarking; heterogeneous computing;"
"Work Distribution, Power, Heterogeneous System,
GPGPU"

portability parallel GPU programming

"OpenCL, Clusters, Heterogeneous computing, Programming
models"
GPU; Multicore; Accelerator; Scheduling; Runtime System

Hybrid architecture for 3D visualization of ultrasonic data
Non-negative Matrix Factorization on Low-Power Architectures and Accelerators: A Comparative Study
A efficient algorithm for molecular dynamics simulation on hybrid CPU-GPU computing platforms
A Software Toolchain for Variability Awareness on Heterogenous Multicore Platforms
A user mode CPU–GPU scheduling framework for hybrid workloads
Accelerating a Computer Vision Algorithm on a Mobile SoC Using CPU-GPU Co-processing - A Case Study on Face Detection
Dynamic Load Balancing for High-Performance Graph Processing on Hybrid CPU-GPU Platforms
Energy conservation for GPU–CPU architectures with dynamic workload division and frequency scaling
Enhanced Energy Efficiency with the Actor Model on Heterogeneous Architectures
Enhancing Metaheuristic-based Virtual Screening Methods on Massively Parallel and Heterogeneous Systems
Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms
Hybrid heuristics for mapping task problem on large scale heterogeneous platforms
micMR: An efficient MapReduce framework for CPU–MIC heterogeneous architecture
Optimizing parallel join of column-stores on heterogeneous computing platforms
OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures
Parallel Graph Partitioning on a CPU-GPU Architecture
Using just-in-time code generation for transparent resource management in heterogeneous systems

With the current proliferation of multi-core processors and hardware acceleration, large-data processing methods are increasingly being modified for use in parallel and distributed computing environments. In this paper, we present a hybrid architecture for the visualization and processing of large-scale volumetric data. Various hardware environments and technologies are integrated in this architecture to perform interactive operations on very large volumetric datasets. All of the datasets are stored in a data center with a gigabit network environment. Time-consuming data-processing tasks are allocated to compute nodes connected to the same network, while the visualization and interaction operations are executed on a high-performance graphics workstation. OpenCL and OpenMP are used to implement volume rendering algorithms to accelerate visualization of a hierarchical volume data structure using GPUs and multi-core CPUs. Various out-of-core algorithms are also presented to process the large dataset directly. The experimental results indicate that the proposed hybrid architecture and methods are effective and efficient in processing and visualizing very large-volumetric datasets.
Power consumption is emerging as one of the main concerns in the High Performance Computing (HPC) field. As a growing number of bioinformatics applications require HPC techniques and parallel architectures to meet performance requirements, power consumption arises as an additional limitation when accelerating them. In this paper, we present a comparative study of optimized implementations of the Non-negative Matrix Factorization (NMF), that is widely used in many fields of bioinformatics, taking into account both performance and power consumption. We target a wide range of state-of-the-art parallel architectures, including general-purpose, low-power processors and specific-purpose accelerators like GPUs, DSPs or the Intel Xeon Phi. From our study, we gain insights in both performance and energy consumption for each one of them under a number of experimental conditions, and conclude that the most appropriate architecture is usually a trade-off between performance and energy consumption for a given experimental setup and dataset.
"In this article, an efficient parallel algorithm for a hybrid CPU-GPU platform is proposed to enable large-scale molecular dynamics (MD) simulations of the metal solidification process. The results, implemented the parallel algorithm program on the hybrid CPU-GPU platform shows better performance than the program based on previous algorithms running on the CPU cluster platform. By contrast, the total execution time of the new program has been obviously decreased. Particularly, because of the use of the modified load balancing method, the neighbor list update time is approximately zero. The parallel program based on the CUDA+OpenMP model shows a factor of 6 16-core calculation speedups compared to the parallel program based on the MPI+OpenMP model, and the optimal computational efficiency is achieved in the simulation system including 10,000,000 aluminum atoms. Finally, the good consistency between them verifies the correctness of the algorithm efficiently, by comparison of the theoretical results and experimental results.
"
Workload allocation in embedded multicore platforms is an increasing challenging issue due to heterogeneity of components and their parallelism. Additionally, the impact of process variations in current and next generation technology nodes is becoming relevant and cannot be compensated at the device or architectural level. Intra-die process variations raising at the core level and platform level makes parallel multicore platforms intrinsically heterogeneous, because the various cores are clocked at different operational frequencies. Power consumption becomes heterogeneous too, both considering dynamic and leakage consumption. In this context, to fully exploit the computational capability of the platform parallelism, variability aware task allocation strategies must be adopted. Despite the consistent research performed to design variability-aware task allocation policies, little effort has been devoted make available to programmers a software toolchain enabling the exploitation of these policies. Such toolchain need to exploit fabrication-level information about core clock speed and power consumption. In this work, we address to present a methodology and the associated toolchain to program in presence of process variability, integrating power and performance variability information in all the steps of the toolchain. To this purpose, the proposed approach is vertically integrated, from high level modelling down to runtime management. Variability information is introduced through a XML configuration file that is exploited by toolchain components to make the appropriate runtime allocation decision. We demonstrate the proposed toolchain using state of art variability-aware task allocation policies on two multicore platforms: i) The MIPS-based GENEPY simulator with 4 and 8 parallel homogeneous cores and ii) The Tegra2-based Zynq platform, where the on-board FPGA has been used to map 10 microblaze slave cores. Experiments show that the proposed toolchain supports the integration of variability awareness in a simple yet effective programming environment.
Cloud platforms composed of multi-core CPU and many-core Graphics Processing Unit (GPU) have become powerful platforms to host incremental CPU–GPU workloads. In this paper, we study the problem of optimizing the CPU resource management while keeping the quality of service (QoS) of games. To this end, we propose vHybrid, a lightweight user mode runtime framework, in which we integrate a scheduling algorithm for GPU and two algorithms for CPU to efficiently utilize CPU resources with the control accuracy of QoS. vHybrid can maintain the desired QoS with low CPU utilization, while being able to guarantee better QoS performance with little overhead. Our evaluations show that vHybrid saves 37.29% of CPU utilization with satisfactory QoS for hybrid workloads, and reduces three orders of magnitude for QoS fluctuations, without any impact on GPU workloads.
Recently, mobile devices have become equipped with sophisticated hardware components such as a heterogeneous multi-core SoC that consists of a CPU, GPU, and DSP. This provides opportunities to realize computationally-intensive computer vision applications using General Purpose GPU (GPGPU) programming tools such as Open Graphics Library for Embedded System (OpenGL ES) and Open Computing Language (OpenCL). As a case study, the aim of this research was to accelerate the Viola-Jones face detection algorithm which is computationally expensive and limited in use on mobile devices due to irregular memory access and imbalanced workloads resulting in low performance regarding the processing time. To solve the above challenges, the proposed method of this study adapted CPU-GPU task parallelism, sliding window parallelism, scale image parallelism, dynamic allocation of threads, and local memory optimization to improve the computational time. The experimental results show that the proposed method achieved a 3.3~6.29 times increased computational time compared to the well-optimized OpenCV implementation on a CPU. The proposed method can be adapted to other applications using mobile GPUs and CPUs.
Graph analysis is becoming increasingly important in many research fields - biology, social sciences, data mining - and daily applications - path finding, product recommendation. Many different large-scale graph-processing systems have been proposed for different platforms. However, little effort has been placed on designing systems for hybrid CPU-GPU platforms.In this work, we present HyGraph, a novel graph-processing systems for hybrid platforms which delivers performance by using CPUs and GPUs concurrently. Its core feature is a specialized data structure which enables dynamic scheduling of jobs onto both the CPU and the GPUs, thus (1) supersedes the need for static workload distribution, (2) provides load balancing, and (3) minimizes inter-process communication overhead by overlapping computation and communication.Our preliminary results demonstrate that HyGraph outperforms CPU-only and GPU-only solutions, delivering close-to-optimal performance on the hybrid system. Moreover, it supports large-scale graphs which do not fit into GPU memory, and it is competitive against state-of-the-art systems.
In recent years, GPU–CPU heterogeneous architectures have been increasingly adopted in high performance computing, because of their capabilities of providing high computational throughput. However, the energy consumption is a major concern due to the large scale of such kind of systems. There are a few existing efforts that try to lower the energy consumption of GPU–CPU architectures, but they address either GPU or CPU in an isolated manner and thus cannot achieve maximized energy savings. In this paper, we propose GreenGPU, a holistic energy management framework for GPU–CPU heterogeneous architectures. Our solution features a two-tier design. In the first tier, GreenGPU dynamically splits and distributes workloads to GPU and CPU based on the workload characteristics, such that both sides can finish approximately at the same time. We comparatively discuss four dynamic workload allocation algorithms: a Simple Heuristic with fixed step size, an Improved Heuristic with adaptive step size, and two binary search-style algorithms. As a result, the energy wasted on idling and waiting for the slower side to finish is minimized. In the second tier, GreenGPU dynamically throttles the frequencies of GPU cores and memory in a coordinated manner, based on their utilizations, for maximized energy savings with only marginal performance degradation. Likewise, the frequency and voltage of the CPU are scaled similarly. We implement GreenGPU using the CUDA framework on two real hardware testbeds. Experiment results show that GreenGPU achieves 21.04% average energy savings and outperforms several well-designed baselines.
Due to rising energy costs, energy-efficient data centers have gained increasingly more attention in research and practice. Optimizations targeting energy efficiency are usually performed on an isolated level, either by producing more efficient hardware, by reducing the number of nodes simultaneously active in a data center, or by applying dynamic voltage and frequency scaling (DVFS). Energy consumption is, however, highly application dependent. We therefore argue that, for best energy efficiency, it is necessary to combine different measures both at the programming and at the runtime level. As there is a tradeoff between execution time and power consumption, we vary both independently to get insights on how they affect the total energy consumption. We choose frequency scaling for lowering the power consumption and heterogeneous processing units for reducing the execution time. While these options showed to be effective already in the literature, the lack of energy-efficient software in practice suggests missing incentives for energy-efficient programming. In fact, programming heterogeneous applications is a challenging task, due to different memory models of the underlying processors and the requirement of using different programming languages for the same tasks. We propose to use the actor model as a basis for efficient and simple programming, and extend it to run seamlessly on either a CPU or a GPU. In a second step, we automatically balance the load between the existing processing units. With heterogeneous actors we are able to save 40–80 % of energy in comparison to CPU-only applications, additionally increasing programmability.
Molecular docking through Virtual Screening is an optimization problem which can be approached with metaheuristic methods. The interaction between two chemical compounds (typically a protein or receptor and small molecule or ligand) is measured with computationally very demanding scoring functions and can, moreover, be measured at several spots throughout the receptor. For the simulation of large molecules, it is necessary to scale to large clusters to deal with memory and computational requirements. In this paper, we analyze the current landscape of computation, where massive parallelism and heterogeneity are today the main ingredients in large-scale computing systems, to enhance metaheuristic-based virtual screening methods, and thus facilitate the analysis of large molecules. We provide a parallelization strategy aimed at leveraging these features. Our solution finds a good workload balance via dynamic assignment of jobs to heterogeneous resources which perform independent metaheuristic executions under different molecular interactions. A cooperative scheduling of jobs optimizes the quality of the solution and the overall performance of the simulation, so opening a new path for further developments of Virtual Screening methods on high-performance contemporary heterogeneous platforms.
Considering the prevalent usage of multimedia applications on commodity computers equipped with both CPU and GPU devices, the possibility of simultaneously exploiting all parallelization capabilities of such hybrid platforms for high performance video encoding has been highly quested for. Accordingly, a method to concurrently implement the H.264/ advanced video coding (AVC) inter-loop on hybrid GPU + CPU platforms is proposed in this manuscript. This method comprises dynamic dependency aware task distribution methods and real-time computational load balancing over both the CPU and the GPU, according to an efficient dynamic performance modeling. With such optimal balance, the set of rather optimized parallel algorithms that were conceived for video coding on both the CPU and the GPU are dynamically instantiated in any of the existing processing devices, to minimize the overall encoding time. The proposed model does not only provide an efficient task scheduling and load balancing for H.264/AVC inter-loop, but it also does not introduce any significant computational burden to the time-limited video coding application. Furthermore, according to the presented set of experimental results, the proposed scheme has proved to provide speedup values as high as 2.5 when compared with highly optimized GPU-only encoding solutions or even other state of the art algorithm. Moreover, by simply using the existing computational resources that usually equip most commodity computers the proposed scheme is able to achieve inter-loop encoding rates as high as 40 fps at a HD 1920 × 1080 resolution.
Task allocation problem is one of most studied topic in the field of parallel computing. With the emergence of the large scale platforms, scheduling applications on these large heterogeneous parallel systems is a challenging task due to the large number of mapping possibilities. Indeed, different methods to compute the weights of nodes and edges when scheduling directed acyclic graphs onto heterogeneous platforms may lead to significant variations in the generated schedule, in other terms the problem is ill posed. In this work, a new task mapping using a hybrid clustering method based on a branch and bound is proposed to regularize the problem. The obtained results show that the proposed formulation allows to provide a robust solution in most cases.
With the high-speed development of processors, coprocessor-based MapReduce is widely studied. In this paper, we propose micMR, an efficient MapReduce framework for CPU–MIC heterogeneous architecture. micMR mainly provides the following new features. First, the two-level split and the SIMD friendly map are designed for utilizing the Vector Process Units on MIC. Second, heterogeneous pipelined reduce is developed for improving the efficiency of resource utilization. Third, a memory management scheme is designed for accessing <key, value> pairs in both the host and the MIC memory efficiently. In addition, optimization techniques, including load balancing, SIMD hash, and asynchronous task transfer, are designed for achieving more speedups. We have developed micMR not only in a single node with CPU and MIC but also in a CPU–MIC heterogeneous cluster. The experimental results show that micMR is up to 8.4x and 45.8x faster than Phoenix++, a high-performance MapReduce system for symmetric multiprocessing system, and up to 2.0x and 5.1x faster than Hadoop in a CPU–MIC cluster.
GPU and integrated multi-core CPU-GPU architecture has powerful parallel processing capability and programmable pipeline, which gradually becomes a hot area of database researches. In order to fully explore the potential abilities of heterogeneous platform, enhancing the query performance of the column-storage database. In this paper, we are taking full account of differences among system architecture based on heterogeneous platforms, firstly proposed data partition strategy for join operation on multi-thread platform-ICMD algorithm, using stream processor to process sub-space join operation in parallel. Secondly, through the implementation of query dynamic load by using task allocation model evaluation, which makes the query execution in parallel between multi-core CPU, GPU and other accelerator components. At the same time, the use of on-chip global synchronization and efficient implementation, local memory reuse optimized ICMD connection algorithm. Using SSB benchmark test, the experimental results show that based on the platform of Intel HD Graphics 4600, ICMD connection query received 1.35 speedup compared to the CPU version and received 18% performance improvement compared with GPU query engine-Ocelot.
As we integrate data-parallel GPUs with general-purpose CPUs on a single chip, the enormous cache traffic generated by GPUs will not only exhaust the limited cache capacity, but also severely interfere with CPU requests. Such heterogeneous multicores pose significant challenges to the design of shared last-level cache (LLC). This problem can be mitigated by replacing SRAM LLC with emerging non-volatile memories like Spin-Transfer Torque RAM (STT-RAM), which provides larger cache capacity and near-zero leakage power. However, without careful design, the slow write operations of STT-RAM may offset the capacity benefit, and the system may still suffer from contention in the shared LLC and on-chip interconnects. While there are cache optimization techniques to alleviate such problems, we reveal that the true potential of STT-RAM LLC may still be limited because now that the cache hit rate has been improved by the increased capacity, the on-chip network can become a performance bottleneck. CPU and GPU packets contend with each other for the shared network bandwidth. Moreover, the mixed-criticality read/write packets to STT-RAM add another layer of complexity to the network resource allocation. Therefore, being aware of the disparate latency tolerance of CPU/GPU applications and the asymmetric read/write latency of STT-RAM, we propose OSCAR to Orchestrate STT-RAM Caches traffic for heterogeneous ARchitectures. Specifically, an integration of asynchronous batch scheduling and priority based allocation for on-chip interconnect is proposed to maximize the potential of STT-RAM based LLC. Simulation results on a 28-GPU and 14-CPU system demonstrate an average of 17.4% performance improvement for CPUs, 10.8% performance improvement for GPUs, and 28.9% LLC energy saving compared to SRAM based LLC design.
Graph partitioning has important applications in multiple areas of computing, including scheduling, social networks, and parallel processing. In recent years, GPUs have proven successful at accelerating several graph algorithms. However, the irregular nature of the real-world graphs poses a problem for GPUs, which favor regularity. In this paper, we discuss the design and implementation of a parallel multilevel graph partitioner for a CPU-GPU system. The partitioner aims to overcome some of the challenges arising due to memory constraints on GPUs and maximizes the utilization of GPU threads through suitable load-balancing schemes. We present a lock-free shared-memory scheme since fine-grained synchronization among thousands of threads imposes too high a performance overhead. The partitioner, implemented in CUDA, outperforms serial Metisand parallel MPI-based ParMetis. It performs similar to theshared-memory CPU-based parallel graph partitioner mt-metis.
Hardware accelerators are becoming popular in academia and industry. To move one step further from the state-of-the-art multicore plus accelerator approaches, we present in this paper our innovative SAVEHSA architecture. It comprises of a heterogeneous hardware platform with three different high-end accelerators attached over PCIe (GPGPU, FPGA and Intel MIC). Such systems can process parallel workloads very efficiently whilst being more energy efficient than regular CPU systems. To leverage the heterogeneity, the workload has to be distributed among the computing units in a way that each unit is well-suited for the assigned task and executable code must be available. To tackle this problem we present two software components; the first can perform resource allocation at runtime while respecting system and application goals (in terms of throughput, energy, latency, etc.) and the second is able to analyze an application and generate executable code for an accelerator at runtime. We demonstrate the first proof-of-concept implementation of our framework on the heterogeneous platform, discuss different runtime policies and measure the introduced overheads.
Users of heterogeneous computing systems face two problems: first, in understanding the trade-off relationships between the observable characteristics of their applications, such as latency and quality of the result, and second, how to exploit knowledge of these characteristics to allocate work to distributed computing platforms efficiently. A domain specific approach addresses both of these problems. By considering a subset of operations or functions, models of the observable characteristics or domain metrics may be formulated in advance, and populated at run-time for task instances. These metric models can then be used to express the allocation of work as a constrained integer program. These claims are illustrated using the domain of derivatives pricing in computational finance, with the domain metrics of workload latency and pricing accuracy. For a large, varied workload of 128 Black-Scholes and Heston model-based option pricing tasks, running upon a diverse array of 16 Multicore CPUs, GPUs and FPGAs platforms, predictions made by models of both the makespan and accuracy are generally within 10 percent of the run-time performance. When these models are used as inputs to machine learning and MILP-based workload allocation approaches, a latency improvement of up to 24 and 270 times over the heuristic approach is seen.
Today’s supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. Applications executing on such many-core platforms with improved vectorization require high memory bandwidth. To improve performance, architectures like Knights Landing include a high bandwidth and low capacity in-package high bandwidth memory (HBM) in addition to the high capacity but low bandwidth DDR4. Other architectures like Nvidia’s Pascal GPU also expose similar stacked DRAM. In architectures with heterogeneity in memory types within a node, efficient allocation and data movement can result in improved performance and energy savings in future systems if all the data requests are served from the high bandwidth memory. In this paper, we propose a memory-heterogeneity aware runtime system which guides data prefetch and eviction such that data can be accessed at high bandwidth for applications whose entire working set does not fit within the high bandwidth memory and data needs to be moved among different memory types. We implement a data movement mechanism managed by the runtime system which allows applications to run efficiently on architectures with heterogeneous memory hierarchy, with trivial code changes. We show upto 2X improvement in execution time for Stencil3D and Matrix Multiplication which are important HPC kernels.
Workload allocation in embedded multicore platforms is an increasing challenging issue due to heterogeneity of components and their parallelism. Additionally, the impact of process variations in current and next generation technology nodes is becoming relevant and cannot be compensated at the device or architectural level. Intra-die process variations raising at the core level and platform level makes parallel multicore platforms intrinsically heterogeneous, because the various cores are clocked at different operational frequencies. Power consumption becomes heterogeneous too, both considering dynamic and leakage consumption. In this context, to fully exploit the computational capability of the platform parallelism, variability aware task allocation strategies must be adopted. Despite the consistent research performed to design variability-aware task allocation policies, little effort has been devoted make available to programmers a software toolchain enabling the exploitation of these policies. Such toolchain need to exploit fabrication-level information about core clock speed and power consumption. In this work, we address to present a methodology and the associated toolchain to program in presence of process variability, integrating power and performance variability information in all the steps of the toolchain. To this purpose, the proposed approach is vertically integrated, from high level modelling down to runtime management. Variability information is introduced through a XML configuration file that is exploited by toolchain components to make the appropriate runtime allocation decision. We demonstrate the proposed toolchain using state of art variability-aware task allocation policies on two multicore platforms: i) The MIPS-based GENEPY simulator with 4 and 8 parallel homogeneous cores and ii) The Tegra2-based Zynq platform, where the on-board FPGA has been used to map 10 microblaze slave cores. Experiments show that the proposed toolchain supports the integration of variability awareness in a simple yet effective programming environment.
Classification techniques development constitutes a foundation for machine learning evolution, which has become a major part of the current mainstream of Artificial Intelligence research lines. However, the computational cost associated with these techniques limits their use in resource constrained embedded platforms. As the classification task is often combined with other high computational cost functions, ef- ficient performance of the main modules is fundamental requirements to achieve hard real-time speed for the whole system. Graph-based machine learning techniques offer a powerful framework for building classifiers. Optimum-Path Forest (OPF) is a graph-based classifier presenting the interesting ability to provide nonlinear classes separation surfaces. This work proposes a SoC/FPGA based design and implementation of an architecture for embedded applications, presenting a hardware converted algorithm for an OPF classifier. Comparison of the achieved results with an embedded processor software implementation shows accelerations of the OPF classification from 2.18 to 9 times, which permits to expect real-time performance to embedded applications.
In this paper, we present visual analysis techniques to evaluate the performance of HPC task-based applications on hybrid architectures. Our approach is based on composing modern data analysis tools (pjdump, R, ggplot2, plotly), enabling an agile and flexible scripting framework with minor development cost. We validate our proposal by analyzing traces from the fullfledged implementation of the Cholesky decomposition available in the MORSE library running on a hybrid (CPU/GPU) platform. The analysis compares two different workloads and three different task schedulers from the StarPU runtime system. Our analysis based on composite views allows to identify allocation mistakes, priority problems in scheduling decisions, GPU tasks anomalies causing bad performance, and critical path issues.
Current heterogeneous platforms with CPUs and accelerators have the ability to launch several independent tasks simultaneously, in order to exploit concurrency among them. These tasks typically consist of data transfer commands and kernel computation commands. In this paper we develop a runtime approach to optimize the concurrency between data transfers and kernel computation commands in a multithreaded scenario where each CPU thread offloads tasks to the accelerator. It deploys a heuristic based on a temporal execution model for concurrent tasks. It is able to establish a near-optimal task execution order that significantly reduces the total execution time, including data transfers. Our approach has been evaluated employing five different benchmarks composed of dominant kernel and dominant transfer real tasks. In these experiments our heuristic achieves speedups up to 1.5x in AMD R9 and NVIDIA K20c accelerators and 1.3x in an Intel Xeon Phi (KNC) device.
The usage of locating systems in sports elevates match and training analysis to a new level. By tracking players, balls and other sports equipment during matches or training, the performance of players can be analyzed, the training can be adapted and new strategies can be developed. The radio-based RedFIR system equips players and balls in soccer with miniaturized transmitters, while antennas distributed around the playing field receive the transmitted radio signals. A cluster computer processes these signals to determine the exact positions based on the signals’ Time Of Arrival (TOA) at the back end. While such a system works well, it is neither scalable nor inexpensive due to the required computing cluster. Also the relatively high power consumption of the GPU-based cluster is sub optimal. Moreover, high speed interconnects between the antennas and the cluster computers introduce additional costs and increase the installation effort. However, a significant portion of the computing performance is not required for the synthesis of the received data, but for the calculation of the unique TOA values of every receiver line. Therefore, in this paper we propose a smart sensor approach: By integrating some intelligence into the antenna (smart antenna), each antenna correlates the received signal independently of the remaining system and only a comparably small amount of resulting data is sent to the backend. While the idea is quite simple, the question of a well suited computer architecture to fulfill this task inside the smart antenna is more complex. Therefore, this paper provides an evaluation of embedded architectures, such as FPGAs, GPUs, ARM cores as well as a many core CPU (Epiphany), regarding processing performance and energy consumption. Additionally, we show that performance and energy consumption can be improved through heterogeneous computing techniques. Thereby, we are able to achieve the required 50.400 correlations per second in each smart antenna. As a result, the backend becomes lightweight, cheaper interconnects through data reduction are required and the system becomes more scalable, since most processing power is already integrated in the antenna. In addition, the evaluation results indicate that Software Defined Radio (SDR) approaches in general might benefit from a more diverse application of processing platforms.
Heterogeneous computing is a growing trend in recent computer architecture design and is often used to improve the performance and power efficiency for computing applications by utilizing the special-purpose processors or accelerators, such as the Graphic Computing Unit (GPU), Field Programmable Gate Array (FPGA) and Digital Signal Processor (DSP). With the increase of complexity, the interaction among accelerators and processors could be deadfall if a race condition happens. However, the existing tools for such task are either too slow or hard to extend the race condition detection mechanism. Therefore, tools for application profiling with approximate timing model are important to the design of such heterogeneous systems in a timing manner. In this paper, we proposed a pluggable GPU interface on an existing timing approximate CPU simulator based on QEMU for analyzing the memory behavior of heterogeneous systems. Monitoring the memory behavior, the pluggable interface could be extended to any kinds of accelerators, such as GPU, DSP and FPGA, for race condition detection. Taking the GPU as an example, we integrated the detailed GPU simulator from Multi2Sim with the existing timing approximate CPU simulator, VPA, to showcase the efficiency of the proposed work. The experimental results showed that the emulation speed of the proposed framework could even reach 9 ×  faster than Multi2Sim in some cases. In addition, the race condition detection mechanism further indicates the problematic memory accesses to users.
Path planning is one of the key functional blocks for autonomous vehicles constantly updating their route in real-time. Heterogeneous many-cores are appealing candidates for its execution, but the high degree of resource sharing results in very unpredictable timing behavior. The predictable execution model (PREM) has the potential to enable the deployment of real-time applications on top of commercial off-the-shelf (COTS) heterogeneous systems by separating compute and memory operations, and scheduling the latter in an interference-free manner. This paper studies PREM applied to a state-of-the-art path planner running on a NVIDIA Tegra X1, providing insight on memory sharing and its impact on performance and predictability. The results show that PREM reduces the execution time variance to near-zero, providing a 3× decrease in the worst case execution time.
The deployment of real-time workloads on commercial off-the-shelf (COTS) hardware is attractive, as it reduces the cost and time-to-market of new products. Most modern high-end embedded SoCs rely on a heterogeneous design, coupling a general-purpose multi-core CPU to a massively parallel accelerator, typically a programmable GPU, sharing a single global DRAM. However, because of non-predictable hardware arbiters designed to maximize average or peak performance, it is very difficult to provide timing guarantees on such systems. In this work we present our ongoing work on GPUguard, a software technique that predictably arbitrates main memory usage in heterogeneous SoCs. A prototype implementation for the NVIDIA Tegra TX1 SoC shows that GPUguard is able to reduce the adverse effects of memory sharing, while retaining a high throughput on both the CPU and the accelerator.
An accurate and fast human detection is a crucial task for a wide variety of applications such as automotive and person identification. The histogram of oriented gradients (HOG) algorithm is one of the most reliable and applied algorithms for this task. However the HOG algorithm is also a compute intensive task. This paper presents three different implementations using the Zynq SoC that consists of an ARM processor and an FPGA. The first uses OpenCV functions and runs on the ARM processor. A speedup of 249imageis achieved due to several optimizations that are implemented in this OpenCV-based HOG approach. The second is a HW/SW Co-Design implemented on the ARM processor and the FPGA. The third is completely implemented on the FPGA and optimized for an FPGA implementation to achieve the highest performance for high resolution images (image). This implementation achieves 39.6 fps which is a speedup of image compared to the OpenCV-based approach and image compared to this implementation with optimizations. The HW/SW Co-Design achieves a speedup of approximately image compared to an original HOG implementation running on the ARM processor.
Heterogeneous systems composed of CPUs and accelerators sharing communication channels of different performance are getting mainstream in HPC but, at the same time, they show a complexity that makes it difficult to optimize the deployment of a data parallel application. Recent analytical tools such as Functional Performance Models, combined with advanced partitioning algorithms, manage to achieve a balanced configuration by distributing the workload unevenly, according to the performance of the different processing units. Unfortunately, such uneven distribution of the computation load leads to communication unbalances that, very often, render worthless the previous workload balancing efforts. Finding the optimal communication scheme without expensive testing on the executing platform requires an analytical approach to the estimation of the communication cost of different configurations of the application. With this goal in mind, we propose and discuss an extension of the τ-Lop communication performance model to cover heterogeneous architectures. In order to provide a quantitative assessment of this extended model, we conduct experiments with two representative computational kernels, the SUMMA algorithm and the 2D wave equation solver. The τ-Lop predictions are compared against the HLogGP model and the observed costs for a variety of configurations, hardware resources and problem sizes.
Performance and energy are two major concerns for application development on heterogeneous platforms. It is challenging for application developers to fully exploit the performance/energy potential of heterogeneous platforms. One reason is the lack of reliable prediction of the system’s performance/energy before application implementation. Another reason is that a heterogeneous platform presents a large design space for workload partitioning between different processors. To reduce such development cost, this article proposes a framework, PeaPaw, to assist application developers to identify a workload partition (WP) that has high potential leading to high performance or energy efficiency before actual implementation. The PeaPaw framework includes both analytical performance/energy models and two sets of workload partitioning guidelines. Based on the design goal, application developers can obtain a workload partitioning guideline from PeaPaw for a given platform and follow it to design one or multiple WPs for a given workload. Then PeaPaw can be used to estimate the performance/energy of the designed WPs, and the WP with the best estimated performance/energy can be selected for actual implementation. To demonstrate the effectiveness of PeaPaw, we have conducted three case studies. Results from these case studies show that PeaPaw can faithfully estimate the performance/energy relationships of WPs and provide effective workload partitioning guidelines.
The majority of contemporary mobile devices and personal computers are based on heterogeneous computing platforms that consist of a number of CPU cores and one or more Graphics Processing Units (GPUs). Despite the high volume of these devices, there are few existing programming frameworks that target full and simultaneous utilization of all CPU and GPU devices of the platform. This article presents a dataflow-flavored Model of Computation (MoC) that has been developed for deploying signal processing applications to heterogeneous platforms. The presented MoC is dynamic and allows describing applications with data dependent run-time behavior. On top of the MoC, formal design rules are presented that enable application descriptions to be simultaneously dynamic and decidable. Decidability guarantees compile-time application analyzability for deadlock freedom and bounded memory. The presented MoC and the design rules are realized in a novel Open Source programming environment & #x201C;PRUNE & #x201D; and demonstrated with representative application examples from the domains of image processing, computer vision and wireless communications. Experimental results show that the proposed approach outperforms the state-of-the-art in analyzability, flexibility and performance.
Heterogeneous system-on-chips (SoC) that include both general-purpose processors and field programmable gate arrays (FPGAS) are emerging as very promising platforms to develop modern cyber-physical systems, combining the typical flexibility enabled by software with the speedup achievable by custom hardware accelerators. Furthermore, the dynamic partial reconfiguration (DPR) capabilities of modern FPGAS make such platforms even more attractive, offering the possibility of virtualizing the FPGA area to support several hardware accelerators in time sharing. However, heterogeneous platforms originate considerable challenges in the design and development process of applications, especially if timing and energy constraints are concerned. The FRED framework has been recently proposed to support the development of real-Time applications upon such platforms, using a static slotted-based partitioning of the FPGA area to ensure predictable delays when managing custom hardware accelerators by DPR. This paper addresses the problem of designing a suitable FPGA partitioning to support the execution of a real-Time application within the FRED framework. The problem is formulated as a mixed-integer linear program that is in charge of (i) designing the size of the slots (in terms of FPGA resources), (ii) allocating hardware tasks to the slots, and (iii) selecting which hardware tasks must be statically allocated to the FPGA, while ensuring bounded worst-case response times on the tasks.
Memetic agent-based paradigm, which combines evolutionary computation and local search techniques in one of promising meta-heuristics for solving large and hard discrete problem such as Low Autocorrellation Binary Sequence (LABS) or optimal Golomb-ruler (OGR). In the paper as a follow-up of the previous research, a short concept of hybrid agent-based evolutionary systems platform, which spreads computations among CPU and GPU, is shortly introduced. The main part of the paper presents an efficient parallel GPU implementation of LABS local optimization strategy. As a means for comparison, speed-up between GPU implementation and CPU sequential and parallel versions are shown. This constitutes a promising step toward building hybrid platform that combines evolutionary meta-heuristics with highly efficient local optimization of chosen discrete problems.
Heterogeneous computing platforms that use GPUs for acceleration are becoming prevalent. Developing parallel applications for GPU platforms and optimizing GPU related applications for good performance is important. In this work, we develop a set of applications based on a high level task design, which ensures a well defined structure for portability improvement. Together with the GPU task implementation, we utilize a uniform interface to allocate and manage memory blocks that are used by both host and device. In this way we can choose the appropriate types of memory for host/device communication easily and flexibly in GPU tasks. Through asynchronous task execution and CUDA streams, we can explore concurrent GPU kernels for performance improvement when running multiple tasks. We developed a test benchmark set containing nine different kernel applications. Through tests we can learn that pinned memory can improve host/device data transfer for GPU platforms. The performance of unified memory differs a lot on different GPU architectures and is not a good choice if performance is the main focus. The multiple task tests show that applications based on our GPU tasks can effectively make use of the concurrent kernel ability of modern GPUs for better resource utilization.
