The OpenMP Multi-Threaded Template Library, or OMPTL, provides
transparant parallelization.
With "Dual-Core" and
"HyperThreading"
processors on many desktops, and more to come, current software must be
parallelized to take advantage of the available hardware. Parallelizing
programs is a non-trival task. The OMPTL re-implements part of the Standard
Template Library of C++. The range is partitioned, then the computation is
executed in parallel. The OMPTL uses OpenMP
for parallelization.
The OMPTL requires that compile your programs with an OpenMP-capable compiler,
for example the Intel(C) compiler or GCC 4.2:
icc -I/path/to/omptl/ -openmp myprog.cpp
Contrarily to what one might expect, the OMPTL is not at all eager to execute
tasks in parallel. The truth of the matter is that paralellization tends to
introduce overhead and a loss of efficiency, or it may saturate memory bandwith;
therefor, it could even result in a loss of performance. In many cases,
using a serial version of an algorithm is simply the better choice, a testimony
to the excellent quality of the Standard Template Library. Even if parts are
executed in parallel, the application will only undergo a significant speedup if
the parallelized work represents a significant part of the computation required
by your application. Thirdly, each call to an algorithm must be on a
sufficiently large range, and not successive calls on small ranges. The fourth
restrictions is that only calls to STL's "algorithm" and "numeric" are
parallelized, so if your code does not use these, it will not benefit. And the
last bad news: not all algorithms are parallelized yet, and some never will be.
Having said all these bad things, there is no penalty for using the OMPTL, and
changing your code to use the OMPTL is extremely easy, so you really have only
to gain from using it. If your application uses time-consuming operations on
large data, such as in Image Processing, you will definately be interested.
The OMPTL is available under the
LGPL, the "Lesser GNU Public
License". This license is the authorative text, here follows a non-authorative
summary:
Using OMPTL
OMPTL Programming
Imagine the following piece of code, which is serial:
#include <vector>
#include <algorithm>
int main (int argc, char * const argv[])
{
std::vector<int> v1(100000);
std::sort(v1.begin(), v1.end());
return 0;
}
This example is the parallel code with OMPTL:
#include <vector>
#include <omptl_algorithm>
int main (int argc, char * const argv[])
{
// Number of threads is derived from environment
// variable "OMP_NUM_THREADS"
std::vector<int> v1(100000);
omptl::sort(v1.begin(), v1.end());
return 0;
}
Compiling & Linking
The OMPTL will automatically detect if OpenMP can be used; if not, it will
divert all calls to serial code. No code changes are necessary. The OMPTL
is entirely based on templates, no linking is be needed.
Function Details
Algorithm
adjacent_find
Not parallelized.
binary_search
Parallelized, efficiency loss.
copy
Parallelized.
copy_backward
Parallelized.
count
Parallelized.
count_if
Parallelized.
equal
Parallelized.
equal_range
Not (yet) parallelized.
fill
Parallelized.
fill_n
Parallelized.
find
Parallelized, efficiency loss.
find_if
Parallelized, efficiency loss.
find_end
Not (yet) parallelized.
find_first_of
Parallelized, efficiency loss.
for_each
Parallelized.
generate
Not parallelized. The function generate is explicitly
expected to change the state of the Generator, so
parallellization is not possible. See the extention par_generate
for a parallel version.
push_heap
Not (yet) parallelized.
pop_heap
Not (yet) parallelized.
make_heap
Not (yet) parallelized.
sort_heap
Not (yet) parallelized.
includes
Parallelized.
lexicographical_compare
Not parallelized.
lower_bound
Parallelized, efficiency loss.
merge
Not parallelized.
min_element
Parallelized.
max_element
Parallelized.
mismatch
Parallelized, efficiency loss.
nth_element
Not (yet) parallelized.
partial_sort
Parallelized.
partial_sort_copy
Not parallelized.
partition
Not parallelized.
stable_partition
Not parallelized.
next_permutation
Not (yet) parallelized.
prev_permutation
Not (yet) parallelized.
random_shuffle
Not (yet) parallelized.
remove
Not parallelized.
remove_copy
Not parallelized.
remove_if
Not parallelized.
remove_copy_if
Not parallelized.
replace
Parallelized.
replace_copy_if
Parallelized.
replace_copy
Parallelized.
replace_if
Parallelized.
reverse
Not (yet) parallelized.
reverse_copy
Not (yet) parallelized.
rotate
Not (yet) parallelized.
rotate_copy
Not (yet) parallelized.
search
Parallelized, efficiency loss.
search_n
Not (yet) parallelized.
set_difference
Not parallelized.
set_intersection
Not parallelized.
set_symmetric_difference
Not parallelized.
set_union
Not parallelized.
sort
Parallelized.
stable_sort
Not (yet) parallelized.
swap_ranges
Parallelized.
transform
Parallelized.
unique
Not parallelized.
unique_copy
Not parallelized.
upper_bound
Parallelized.
Extentions to Algorithm
template <class ForwardIterator, class Generator>
void par_generate(ForwardIterator first, ForwardIterator last, Generator gen,
const unsigned P = omp_get_max_threads())
Numeric
accumulate
Parallelized for addition and multiplication.
adjacent_difference
Not (yet) parallelized.
inner_product
Parallelized for addition and multiplication.
partial_sum
Not (yet) parallelized.
Extentions to Numeric
template <class InputIterator1, class InputIterator2>
typename ::std::iterator_traits<InputIterator1>::value_type
L1(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, const unsigned P = omp_get_max_threads())
"Manhattan" distance between two sets of data.
template <class InputIterator1, class InputIterator2>
typename ::std::iterator_traits<InputIterator1>::value_type
L2(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, const unsigned P = omp_get_max_threads())
"Euclidean" distance between two sets of data.
template <class InputIterator>
typename ::std::iterator_traits<InputIterator>::value_type
L2(InputIterator first, InputIterator last,
const unsigned P = omp_get_max_threads())
Euclidean vector length.
template <class Iterator, class T,
class UnaryFunction, class BinaryFunction>
T transform_accumulate(Iterator first, Iterator last, T init,
UnaryFunction unary_op, BinaryFunction binary_op,
const unsigned P = omp_get_max_threads())
A combination of transform and accumulate. Applies
unary_op to every element in the range
[first ... last], and accumulates the results using
binary_op.
template <class Iterator, class T, class UnaryFunction>
T transform_accumulate(Iterator first, Iterator last, T init,
UnaryFunction unary_op,
const unsigned P = omp_get_max_threads())
A combination of transform and accumulate. Applies
unary_op to every element in the range
[first ... last], and accumulates by addition.