OMPTL

The OpenMP Multi-Threaded Template Library, or OMPTL, provides transparant parallelization.

With "Dual-Core" and "HyperThreading" processors on many desktops, and more to come, current software must be parallelized to take advantage of the available hardware. Parallelizing programs is a non-trival task. The OMPTL re-implements part of the Standard Template Library of C++. The range is partitioned, then the computation is executed in parallel. The OMPTL uses OpenMP for parallelization.

The OMPTL requires that compile your programs with an OpenMP-capable compiler, for example the Intel(C) compiler or GCC 4.2:

	icc -I/path/to/omptl/ -openmp myprog.cpp

Contrarily to what one might expect, the OMPTL is not at all eager to execute tasks in parallel. The truth of the matter is that paralellization tends to introduce overhead and a loss of efficiency, or it may saturate memory bandwith; therefor, it could even result in a loss of performance. In many cases, using a serial version of an algorithm is simply the better choice, a testimony to the excellent quality of the Standard Template Library. Even if parts are executed in parallel, the application will only undergo a significant speedup if the parallelized work represents a significant part of the computation required by your application. Thirdly, each call to an algorithm must be on a sufficiently large range, and not successive calls on small ranges. The fourth restrictions is that only calls to STL's "algorithm" and "numeric" are parallelized, so if your code does not use these, it will not benefit. And the last bad news: not all algorithms are parallelized yet, and some never will be.

Having said all these bad things, there is no penalty for using the OMPTL, and changing your code to use the OMPTL is extremely easy, so you really have only to gain from using it. If your application uses time-consuming operations on large data, such as in Image Processing, you will definately be interested.

The OMPTL is available under the LGPL, the "Lesser GNU Public License". This license is the authorative text, here follows a non-authorative summary:
  • You are allowed to use the OMPTL library in your all your software. Your software does not need to be open-source or free.
  • If you make changes and improvements to the OMPTL itself, you are required to make these changes available.
  • You may not claim or pretend to own the code in whole or in parts, violate the copyright, or remove the license notes.
  • Use of this software is entirely at your own risk.

Using OMPTL

OMPTL Programming

Imagine the following piece of code, which is serial:

#include <vector>
#include <algorithm>

int main (int argc, char * const argv[])
{
	std::vector<int> v1(100000);

	std::sort(v1.begin(), v1.end());

	return 0;
}
This example is the parallel code with OMPTL:
#include <vector>
#include <omptl_algorithm>

int main (int argc, char * const argv[])
{
	// Number of threads is derived from environment
	// variable "OMP_NUM_THREADS"

	std::vector<int> v1(100000);

	omptl::sort(v1.begin(), v1.end());

	return 0;
}

Compiling & Linking

The OMPTL will automatically detect if OpenMP can be used; if not, it will divert all calls to serial code. No code changes are necessary. The OMPTL is entirely based on templates, no linking is be needed.

Function Details

Algorithm

adjacent_find
Not parallelized.
binary_search
Parallelized, efficiency loss.
copy
Parallelized.
copy_backward
Parallelized.
count
Parallelized.
count_if
Parallelized.
equal
Parallelized.
equal_range
Not (yet) parallelized.
fill
Parallelized.
fill_n
Parallelized.
find
Parallelized, efficiency loss.
find_if
Parallelized, efficiency loss.
find_end
Not (yet) parallelized.
find_first_of
Parallelized, efficiency loss.
for_each
Parallelized.
generate
Not parallelized. The function generate is explicitly expected to change the state of the Generator, so parallellization is not possible. See the extention par_generate for a parallel version.
push_heap
Not (yet) parallelized.
pop_heap
Not (yet) parallelized.
make_heap
Not (yet) parallelized.
sort_heap
Not (yet) parallelized.
includes
Parallelized.
lexicographical_compare
Not parallelized.
lower_bound
Parallelized, efficiency loss.
merge
Not parallelized.
min_element
Parallelized.
max_element
Parallelized.
mismatch
Parallelized, efficiency loss.
nth_element
Not (yet) parallelized.
partial_sort
Parallelized.
partial_sort_copy
Not parallelized.
partition
Not parallelized.
stable_partition
Not parallelized.
next_permutation
Not (yet) parallelized.
prev_permutation
Not (yet) parallelized.
random_shuffle
Not (yet) parallelized.
remove
Not parallelized.
remove_copy
Not parallelized.
remove_if
Not parallelized.
remove_copy_if
Not parallelized.
replace
Parallelized.
replace_copy_if
Parallelized.
replace_copy
Parallelized.
replace_if
Parallelized.
reverse
Not (yet) parallelized.
reverse_copy
Not (yet) parallelized.
rotate
Not (yet) parallelized.
rotate_copy
Not (yet) parallelized.
search
Parallelized, efficiency loss.
search_n
Not (yet) parallelized.
set_difference
Not parallelized.
set_intersection
Not parallelized.
set_symmetric_difference
Not parallelized.
set_union
Not parallelized.
sort
Parallelized.
stable_sort
Not (yet) parallelized.
swap_ranges
Parallelized.
transform
Parallelized.
unique
Not parallelized.
unique_copy
Not parallelized.
upper_bound
Parallelized.

Extentions to Algorithm

template <class ForwardIterator, class Generator>
void par_generate(ForwardIterator first, ForwardIterator last, Generator gen,
		  const unsigned P = omp_get_max_threads())

Numeric

accumulate
Parallelized for addition and multiplication.
adjacent_difference
Not (yet) parallelized.
inner_product
Parallelized for addition and multiplication.
partial_sum
Not (yet) parallelized.

Extentions to Numeric

template <class InputIterator1, class InputIterator2>
typename ::std::iterator_traits<InputIterator1>::value_type
L1(InputIterator1 first1, InputIterator1 last1,
   InputIterator2 first2, const unsigned P = omp_get_max_threads())
"Manhattan" distance between two sets of data.
template <class InputIterator1, class InputIterator2>
typename ::std::iterator_traits<InputIterator1>::value_type
L2(InputIterator1 first1, InputIterator1 last1,
   InputIterator2 first2, const unsigned P = omp_get_max_threads())
"Euclidean" distance between two sets of data.
template <class InputIterator>
typename ::std::iterator_traits<InputIterator>::value_type
L2(InputIterator first, InputIterator last,
   const unsigned P = omp_get_max_threads())
Euclidean vector length.
template <class Iterator, class T,
	   class UnaryFunction, class BinaryFunction>
T transform_accumulate(Iterator first, Iterator last, T init,
		UnaryFunction unary_op, BinaryFunction binary_op,
		const unsigned P = omp_get_max_threads())
A combination of transform and accumulate. Applies unary_op to every element in the range [first ... last], and accumulates the results using binary_op.
template <class Iterator, class T, class UnaryFunction>
T transform_accumulate(Iterator first, Iterator last, T init,
			UnaryFunction unary_op,
			const unsigned P = omp_get_max_threads())
A combination of transform and accumulate. Applies unary_op to every element in the range [first ... last], and accumulates by addition.