- Comparison of MPI, OpenMP, and Stream Processing
=MPI=
MPI is a language-independent "
communications protocol " used to programparallel computers . Both point-to-point and collective communication are supported. MPI "is amessage-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation." [Gropp "et al" 96, p.3] So, MPI is a specification, not an implementation.MPI is not sanctioned by any major standards body; nevertheless, it has become the "
de facto " standard forcommunication among processes that model a parallel program running on a "distributed memory system". Actual distributed memory supercomputers such as computer clusters often run these programs. The principal MPI-1 model has no shared memory concept, and MPI-2 has only a limited distributed shared memory concept. Nonetheless, MPI programs are regularly run on shared memory computers.Designing programs around the MPI model (as opposed to explicit
shared memory models) has advantages on NUMA architectures as programming for MPI encourages memory locality.Most MPI implementations consist of a specific set of routines (API) callable from
Fortran , C, orC++ and from any language capable of interfacing with such routine libraries. The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because eachimplementation is in principle optimized for thehardware on which it runs).MPI is often compared with PVM, which is a popular distributed environment and message passing system developed in 1989, and which was one of the systems that motivated the need for standard parallel message passing systems.
Threaded shared memory programming models (such as
Pthreads andOpenMP ) and message passing programming (MPI/PVM ) can be considered as complementary programming approaches.OpenMP
OpenMP is an implementation of "multithreading ", a method of parallelization whereby the master "thread" (a series of instructions executed consecutively) "forks" a specified number of slave "threads" and a task is divided among them. The threads then run concurrently, with theruntime environment allocating threads to different processors.The section of code that is meant to run in parallel is marked accordingly, with a
preprocessor directive that will cause the threads to form before the section is executed. Each thread has an "id" attached to it which can be obtained using a function (calledomp_get_thread_num()
in C/C++ andOMP_GET_THREAD_NUM()
inFORTRAN ). The thread id is an integer, and the master thread has an id of "0". After the execution of the parallelized code, the threads "join" back into the master thread, which continues onward to the end of the program. The number of threads for execution can be determined either statically (by environment variables ) or dynamically (by a function call).By default, each thread executes the parallelized section of code independently. "Work-sharing constructs" can be used to divide a task among the threads so that each thread executes its allocated part of the code. Both
Task parallelism andData parallelism can be achieved using OpenMP in this way.tream Processing
Stream processing is a parallelcomputer programming paradigm that places many additional restrictions which streamline the hardware. The term comes from the concept of streaming data in and out of an execution core without utilizing inter-thread communication, scattered (ie, random) writes or even reads, or local memory. Also branching is often not allowed or is limited (hence, streaming is also strongly related toSIMD ). It best describes real-time audio/video processing and characterizes early GPU as well as many DSP efforts. Modern (DX10) GPUs however remove many of these limitations and are essentially multithreaded, although still retain many peculiarities compared to ordinary multi-core CPUs. (Nevertheless, marketing continues to erroneously characterize modern GPGPU programming as 'stream processing.')The stream processing paradigm, in its pure form, is highly efficient. Algorithms that don't require the missing features can be written quickly and run on optimized hardware. The runtime can also automate certain tasks, such as DMA management, thread launching, and resource management. The hardware is drastically simplified and hence can be made much more powerful for the same die area. Stream processing is the means by which specialized audio and video chips were able to process vast amounts of data in real time on workstations and personal computers long before general central processing units could handle the feat.
Pros and Cons of MPI
* Pros of MPI
**does not require shared memory architectures which are more expensive than distributed memory architectures
**can be used on a wider range of problems since it exploits bothtask parallelism anddata parallelism
**highly portable with specific optimization for the implementation on most hardware* Cons of MPI
**requires more programming changes to go from serial to parallel version
**can be harder to debug
**performance is limited by the communication network between the nodesPros and Cons of OpenMP
*Pros
**easier to program and debug (compared to MPI)
**data layout and decomposition is handled automatically by directives.
**gradual parallelism: directives can be added incrementally so the program can be parallelized one portion after another and thus no dramatic change to code is needed.
**unified code for both serial and parallel applications: OpenMP constructs are treated as comments when sequential compilers are used.
**original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the chance of inadvertently introducing bugs and helps maintenance as well.
**bothcoarse-grained andfine-grained parallelism are possible*Cons
**currently only runs efficiently in shared-memory multiprocessor platforms
**requires a compiler that supports OpenMP.
**scalability is limited by memory architecture.
**reliable error handling is missing.
**lacks fine-grained mechanisms to control thread-processor mapping.
**synchronization between subsets of threads is not allowed.
**mostly used for loop parallelizationReferences
* Hillis, W. Daniel and Steele, Guy L., Data Parallel Algorithms
Communications of the ACM December 1986
* Blelloch, Guy E, Vector Models for Data-Parallel Computing MIT Press 1990. ISBN 0-262-02313-X
* Pros and Cons of MPI and OpenMP http://www.dartmouth.edu/~rc/classes/intro_mpi/parallel_prog_compare.htmlee also
*
SIMD
*Data parallelism
Wikimedia Foundation. 2010.