Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
| For more information: | |
|---|---|
| http://icl.cs.utk.edu/~pjesa/projects/occ | |
| pjesa -AT- cs.utk.edu |
INTRODUCTION
Current parallel programming paradigms for high-performance computing systems are mainly relying on message passing, especially on the Message-Passing Interface (MPI) specification. Shared memory concepts (e.g. OpenMP) or parallel programming languages (e.g. UPC, CoArrayFortran) offer a simpler programming paradigm for applications in parallel environments, however they either lack the scalability to tens of thousands of processors, or do not offer a feasible framework for complex, irregular applications. The message-passing paradigm on the other hand provides a mean to write highly scalable algorithms, abstracting and hiding many architectural decisions from the application developers.
Collective operations are important and frequently used component of MPI standard. They provide data distribution, collection, and reduction functionality for MPI programs. Previous studies show that the performance of collective communications are critical to high-performance computing [4]. For this reason, numerous algorithms have been developed for most of collective operations. The optimal implementation of a collective for a given system depends on many factors, including but not limited to, physical topology, number of processes involved, message size, and the location of the root node when applicable. Furthermore, many algorithms allow explicit segmentation of the message that is being transmitted, in which case the performance of the algorithm also depends on the used segment size. Some collective operations involve local computation (e.g. reduction operations), in which case we also need to consider local characteristics of each node as they could affect our decision on how to overlap communication with computation.
In order to study performance of different collective algorithms, we developed Optimized Collective Communication (OCC) library. The OCC consists of collection of different collective algorithms, basic verification tools, as well as set of micro-benchmarks.
We developed Optimized Collective Communication (OCC) library in order to study performance of different collective algorithms. OCC is built on top of MPI's point-to-point operations, and as such it is MPI implementation independent.
The OCC consists of of different collective algorithms, basic functional verification tools, and set of micro-benchmarks. The library also provides interface for user-defined datatype support - so all verification and performance tests can be done with user defined datatypes without change to the core of library code.
OCC LIBRARY
Currently, the OCC consists of of different collective methods2.1, basic functional verification tools, and set of micro-benchmarks. The library also provides interface for user-defined datatype support - so all verification and performance tests can be done with user defined datatypes without change to the core of library code or perfmance tests.
The collective algorithms in OCC library are implemented on top of MPI point-to-point operations, which allows OCC to be MPI implementation independent. This gaves us two additional benefits: first, if problems occur, we can try to evaluate algorithm using different MPI implementation to focus our efforts to fixing bug in either OCC or FT-MPI; and second, the OCC performance measurement tools can be used to give us direct comparison between different MPI implementations just like any other MPI collective benchmark.
This is good time to introduce the terminology we use: an algorithm is an implementation of a particular collective, particular algorithm can be executed over some virtual topology, and a method is a tuple (collective, algorithm, topology, segment size).
For some collectives we have implemented single algorithm, but can vary virtual topology and segment size, while for others only algorithm and segment size are needed to determine collective performance: virtual topology is not applicable in this case. Thus, we reduced method description to a triple (collective, algorithm/topology, segment size).
Currently, this module supports following collective operations:
MPI_Barrier, MPI_Bcast, MPI_Scatter,
MPI_Reduce, MPI_Alltoall, and MPI_Allreduce,
Currently available topologies are: linear, binomial tree, general tree (including binary), and K-Chain (multiple pipelines).
All of the methods we describe in subsequent subsections have been modeled using various Parallel Computation Models in [3].
OCC implementation of MPI_Reduce currently supports only following native MPI Datatypes: MPI_BYTE, MPI_SHORT, MPI_UNSIGNED_SHORT, MPI_INT, MPI_UNSIGNED, MPI_LONG, MPI_UNSIGNED_LONG, MPI_FLOAT, MPI_DOUBLE; and following native operations: MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND, MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR.
We are thinking whether to include Reduce algorithm proposed by Rabenseifner in [5] but at this time, the algorithm is not included.
In our ``reduced'' method representation these methods are referred to as
(
In order for user to use datatype support it must go through process of
creating datatype, initializing data buffer, possibly modifying buffer
using MPI collective operations or manually (not suggested), verifying
buffer contents, deallocating data buffer and freeing the data type.
Buffer initialization and verification routines take rank of the ``source''
process and communicator size in order to initialize and verify buffer
contents. The value which is expected in the result buffer depends on
collective that took place - so Reduce and Allreduce are analyzed as
a special case.
Currently, the OCC provides following datatypes:
The existing datatype implementations can be used as code examples
for implementing new datatypes.
Simple datatype test programs are defined in
Datatypes/test_occ_data_basic.c and
Datatypes/test_occ_datatype.c.
In general, tests follow the following structure:
On given communicator, run complete test specified number of times.
Complete test consists of running collective specific test function
for specified algorithms, segment sizes, and data counts.
Every registered data point is generated from certain number of
sample points (we usually record minim, average value and standard
deviation of the measurement), which in turn are average of
specified number of measurements.
The test is initialized from command line arguments. The type of
the test is set in the corresponding test's main program. The following
is the list of arguments currently recognized by test initialization function.
Required arguments are denoted as such.
More information on and examples of how to run these tests is
provided in User manual.
Default verification tests parameters that differ from general test are
single test, single sample point, single measurement per sample point.
The output of verification test goes to stdout and is in the
form: (Algorithm name, Communicator size, Data count, Segment size [bytes],
passed/failed).
In general, collective specific verification tests consists of following steps:
Note that data verification must be handled with care: for example
for Broadcast it is enough to simply specify root's rank to verify
contents of the buffer; however for Alltoall we have to check values
in block by block fashion.
The special care was given to the timing of collective operations due to
issues related to it:
To prevent pipelining effects, and actually measure time it took for
the collective to finish, time is measured only at the root node (if root
node is not defined, node 0 is selected to be root), and after every
collective call, ``report-to-root'' step is introduced to ensure
that all processes completed their operation. Unfortunately, the
``report-to-root'' affects the time we measure, especially for small
messages and small number of nodes as the time it takes to perform
``report-to-root'' operation is comperable, and sometimes even longer
than executing collective. Again, repeating this measurement multiple
number of times, and collecting enough sample points can still drive
standard deviation of measurement sufficiently low.
At this time, we do not address caching effects that may occur for
small messages.
Default performance tests parameters that differ from general test are
single test, ten sample point, number of measurement per sample point is
dynamically determined.
USER/DEVELOPER MANUAL
Unfortunately, library does not come with standard configure,
make, and make install scripts. But it does come
with sample makefiles for three different MPI implementations: FT-MPI,
MPICH (1), and MPICH 2. The idea is that generated libraries will not
have name conflicts so we can use them all at the same time.
For this reason, we provided pack target for make
which archives all source and header files but does not archive
the makefiles. Of course, if we added new files to the library, we
will have to manually update Makefiles on other system anyway.
You should run make pack in CollOps directory, and then copy
the occ_pack.tar to wherever you need to.
We hope that this feature will become obsolete when we add configure
support...
Note that the result files for the performance tests are appended, to
the existing file, with header so one can distinguish different runs.
This document was generated using the
LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
The command line arguments were:
The translation was initiated by Jelena Pjesivac-Grbovic on 2005-12-12
MPI_Scatter
Scatter operation is used to distribute data evenly among the processes
within a group: send buffer at the root process is split into N equal
parts and the parts are sent to the corresponding processes.
The following Scatter algorithms are currently available:
Neither of available Scatter algorithms supports segmentation.
MPI_Alltoall
Alltoall is used to exchange data among the all processes in the group.
The operation is equivalent to all processes executing the scatter
operation on their local buffer. The OCC currently provides following
Alltoall implementations:
log2(P)
MPI_Allreduce
Allreduce operation combines the elements provided in the input buffer
of each process in the group using the specified operation, and
the result of operation is available in output buffer of all processes.
We built the MPI_Allreduce function on top of OCC's general reduce
and broadcast functions:
The OCC provides datatype support for verification and
performance measurement purposes. In order for rest of the OCC
library to be able to make usage of data type functionality functions,
the following functions must be modified and corresponding functions
must be defined in
Datatype Support
Datatypes/OCC_Datatypes.c:
OCC_Datatype_constructor,
OCC_Datatype_destructor, OCC_Data_initializer,
OCC_Data_verifier,
OCC_Data_remover, OCC_Data_displayer, and
OCC_Data_array_displayer.
Datatypes/occ_data_sstruct.c.
The tests module supplies support for simple collective
verification functions and simple micro-benchmarks.
At this time, the both tests share same structure determined by
OCC_Test_t structure. The library also supplies basic
data analysis tools.
Tests
Test structure
The OCC_Test_t structure contains detailed information
about the test that is about to be executed: type of test,
name of the collective, number of test repetitions, number of
sample points for every data point, number of measurements for
every sample points (often determined dynamically), minimum and
maximum data count, minimum and maximum communicator size on which
we would like to execute tests, root of operation (if applicable),
datatype name (as defined in Datatypes module), whether
to test native implementation of collective, segment sizes, algorithms
to be tested, function names, and communicator on which we would like
to execute the test.
Method Verification
The method verification tests are defined in OCC_Ver_Test.c.
Performance Measurements
The performance measurement tests are defined in OCC_Perf_Test.c.
The tests are set of micro-benchmarks for each of the collectives.
The output of the performance measurement test is in form:
(Algorithm name, Communicator size, Message size[bytes],
Segment size[bytes], minum duration[
sec
sec
sec
In the future, we plan to investigate the following directions for OCC
development:
Future work and development
TODO list and Development ideas
User/Developer Manual
We assume you obtained occ_dist.tar.gz file either from the
CVS tree, the website, or via email by contacting
pjesa@cs.utk.edu.
Installation
Initial Installation
The steps to install the library are:
These commands will generate statically linked libraries in
the libs directory and corresponding executables in
the Test directory.
mkdir CollOps
tar -xzvf occ_dist.tar.gz
This will create following directory structure:
Datatypes libs makefile Methods occ_dist.tar Tests
and
make mpich-clean
make mpich
and
make mpich2-clean
make mpich2
respectively.
The MPI distribution name will be appended to the name of the library/file,
for example: liboccmpich.a, vtest_ftmpi,
and ptest_mpich2.
Copying the Latest Library to Another System
Assume we set up makefiles on two or more systems, and then we updated
library code on one of the systems.
If we do pure cvs checkout, or copy over whole distribution we will
have to update makefiles manually every time we do this.
In general, the tests are run in the following manner:
Running Tests
./testname -coll <Collective name>
<[-algs <all | number followed by list>] | [-native]>
[optional arguments]
For more information on optional arguments check Section
2.5.
As always, make sure correct startup scripts are used!
Verification
The verification test is defined in Tests/OCC_Ver_Test.c and
the corresponding executable is named vtest_*. The section
2.5 gives more detailed description of
verification tests and how are they built. Here, we only provide
couple of examples on how to run tests.
Keep in mind that the output of this function goes to stdout.
As long as the lines are flying by you with ``passed'' at the end,
everything is fine.
mpirun -np 8 ./vtest_mpich -coll Bcast -algs all
This is generaly bad idea since the default values will test
all datapoints from 0 - 1Kb elements, and then very dense tests up to 131072
elements. In this case, default data type is MPI_BYTE.
In general, the test description should be more specific.
mpirun -np 32 ./vtest_mpich2 -coll Bcast -algs all
-ddt MPI_DOUBLE -segments 2 0 128
-mincount 100 -maxcount 1000
ftmpirun -o -s -np 16 ./vtest_ftmpi -coll Bcast
-algs 2 Splittedbinary Pipeline
-ddt sstruct -segments 2 0 128
-mincount 100 -maxcount 1000
Performance Measurements
The performance test is defined in Tests/OCC_Perf_Test.c and
the corresponding executable is named ptest_*. The section
2.5 gives more detailed description of
performance tests and how are they built. Here, we only provide
couple of examples on how to run tests. Note that there
mpirun -np 8 ./ptest_mpich -coll Alltoall -algs all
This will work but in general, the test description should be more specific.
mpirun -np 32 ./ptest_mpich2 -coll Reduce -algs all
-ddt MPI_DOUBLE -segments 2 0 128
-mincount 100 -maxcount 1000
ftmpirun -o -s -np 16 ./ptest_ftmpi -coll Bcast
-algs 2 Splittedbinary Pipeline
-ddt sstruct -segments 2 0 128
-mincount 100 -maxcount 1000 -numtest 2
Known Bugs
Bibliography
Efficient algorithms for all-to-all communications in multiport
message-passing systems.
IEEE Transactions on Parallel and Distributed Systems,
8(11):1143-1156, November 1997.
Fault tolerant communication library and applications for high
performance computing.
In LACSI Symposium, 2003.
, T. Angskun, G. Bosilca, G. Fagg, E. Gabriel, and
J. Dongarra.
Performance analysis of MPI collective operations.
In Proceedings of 19th International Parallel and Distributed
Processing Symposium, PMEO-PDS Workshop. IEEE Computer Society, April 2005.
Automatic MPI counter profiling of all users: First results on a
CRAY T3E 900-512.
In Proceedings of the Message Passing Interface Developer's and
User's Conference, pages 77-85, 1999.
More efficient reduction algorithms for non-power-of-two number of
processors in message-passing parallel systems.
In Proceedings of EuroPVM/MPI, Lecture Notes in Computer
Science. Springer-Verlag, 2004.
Improving the performance of collective operations in MPICH.
In J. Dongarra, D. Laforenza, and S. Orlando, editors, Recent
Advances in Parallel Virtual Machine and Message Passing Interface, number
2840 in LNCS, pages 257-267. Springer Verlag, 2003.
10th European PVM/MPI User's Group Meeting, Venice, Italy.
Index
About this document ...
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
latex2html -html_version 4.0,math -local_icons -dir /home/pjesa/www-home/projects/occ -split 0 -toc_depth 3 -link 6 -accent_images textrm -t 'Optimized Collective Communication Library' occ_manual.tex
Footnotes
Jelena Pjesivac-Grbovic
2005-12-12