OCC: Optimized Collective Communication
User Manual

December 12, 2005
Innovative Computing Laboratory,
Computer Science Department,
University of Tennessee, Knoxville

Copyright © 1998-2004, University of Tennessee / Team Harness All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

CREDITS
Project Leader Project Supervisor Developers

For more information:  
http://icl.cs.utk.edu/~pjesa/projects/occ  
pjesa -AT- cs.utk.edu  


HARNESS, HARNESS G_HCORE and FT-MPI was funded in part by the U.S. Department of Energy.


Contents


Introduction

INTRODUCTION


MPI Collective Operations

Today, end-users and application developers of high performance computing systems have access to larger machines and more processors than ever before. Systems such as the Earth Simulator, the ASCI-Q machines or the IBM Blue Gene consist of thousands or even tens of thousand of processors. Machines comprising 100,000 processors are expected for the next years.

Current parallel programming paradigms for high-performance computing systems are mainly relying on message passing, especially on the Message-Passing Interface (MPI) specification. Shared memory concepts (e.g. OpenMP) or parallel programming languages (e.g. UPC, CoArrayFortran) offer a simpler programming paradigm for applications in parallel environments, however they either lack the scalability to tens of thousands of processors, or do not offer a feasible framework for complex, irregular applications. The message-passing paradigm on the other hand provides a mean to write highly scalable algorithms, abstracting and hiding many architectural decisions from the application developers.

Collective operations are important and frequently used component of MPI standard. They provide data distribution, collection, and reduction functionality for MPI programs. Previous studies show that the performance of collective communications are critical to high-performance computing [4]. For this reason, numerous algorithms have been developed for most of collective operations. The optimal implementation of a collective for a given system depends on many factors, including but not limited to, physical topology, number of processes involved, message size, and the location of the root node when applicable. Furthermore, many algorithms allow explicit segmentation of the message that is being transmitted, in which case the performance of the algorithm also depends on the used segment size. Some collective operations involve local computation (e.g. reduction operations), in which case we also need to consider local characteristics of each node as they could affect our decision on how to overlap communication with computation.

In order to study performance of different collective algorithms, we developed Optimized Collective Communication (OCC) library. The OCC consists of collection of different collective algorithms, basic verification tools, as well as set of micro-benchmarks.


Optimized Collective Communication Library

We developed Optimized Collective Communication (OCC) library in order to study performance of different collective algorithms. OCC is built on top of MPI's point-to-point operations, and as such it is MPI implementation independent.

The OCC consists of of different collective algorithms, basic functional verification tools, and set of micro-benchmarks. The library also provides interface for user-defined datatype support - so all verification and performance tests can be done with user defined datatypes without change to the core of library code.


Optimized Collective Communication Library

OCC LIBRARY


Introduction

The OCC library is an FT-MPI[2] spin-off project which allow us to quickly implement various collective oprations. The goal of the library is to provide a framework to quickly implement collective algorithm, verify its correctness, and evaluate its performance. When the new algorithm proves to be useful, we add it to the FT-MPI collective communication subsystem using additional optimizations and FT-MPI specific functions to improve its performance.

Currently, the OCC consists of of different collective methods2.1, basic functional verification tools, and set of micro-benchmarks. The library also provides interface for user-defined datatype support - so all verification and performance tests can be done with user defined datatypes without change to the core of library code or perfmance tests.

The collective algorithms in OCC library are implemented on top of MPI point-to-point operations, which allows OCC to be MPI implementation independent. This gaves us two additional benefits: first, if problems occur, we can try to evaluate algorithm using different MPI implementation to focus our efforts to fixing bug in either OCC or FT-MPI; and second, the OCC performance measurement tools can be used to give us direct comparison between different MPI implementations just like any other MPI collective benchmark.


OCC Structure

The library consists of following modules:

This is good time to introduce the terminology we use: an algorithm is an implementation of a particular collective, particular algorithm can be executed over some virtual topology, and a method is a tuple (collective, algorithm, topology, segment size).

For some collectives we have implemented single algorithm, but can vary virtual topology and segment size, while for others only algorithm and segment size are needed to determine collective performance: virtual topology is not applicable in this case. Thus, we reduced method description to a triple (collective, algorithm/topology, segment size).

OCC Naming Policy

The OCC library tries to conform to the following naming policy:


Supported Collectives

The methods module contains various implementations of different collective operations as well as the decision function interpretation functionality.

Currently, this module supports following collective operations:
MPI_Barrier, MPI_Bcast, MPI_Scatter, MPI_Reduce, MPI_Alltoall, and MPI_Allreduce,

Currently available topologies are: linear, binomial tree, general tree (including binary), and K-Chain (multiple pipelines).

All of the methods we describe in subsequent subsections have been modeled using various Parallel Computation Models in [3].

MPI_Barrier

Barrier is a collective operation used to synchronize a group of nodes. OCC currently supports four different Barrier functions:
Linear
This function implements Barrier using linear broadcast followed by linear gather operation. Process with rank 0 in the communicator is selected as root for operation.
Dring
This function implements Barrier using double ring algorithm: process receives message from the rank on the left, and passes it to the right. process may leave barrier once it receives message second time.
Recursive doubling (Recdbl)
This algorithm is intended for Barrier on communicators whose size, P mathend000#, is exact power of 2 mathend000#. In this case, the algorithm takes log2(P) mathend000# steps to complete: At every step mathend000#, process r mathend000# exchanges a zero-byte message with process r + 2step mathend000# with wrap around. When P mathend000# is not exact power of 2 mathend000# the algorithm takes $ \lfloor$log2(P)$ \rfloor$ + 2 mathend000# steps. This algorithm was introduced in [6].
Bruck
This algorithm is a variation of Index algorithm by Bruck described in [1] and [6]. This algorithm takes $ \lceil$log2(P)$ \rceil$ mathend000# steps to complete: at every step mathend000#, process r mathend000#, sends zero-byte message to r + 2step mathend000# and receives message from r - 2step mathend000# process, with wrap around.

MPI_Bcast

Broadcast operation broadcasts a message from the root process to all processes of the group. At the end of the call, the contents of the root's communication buffer is copied to all other processes. The OCC supports two algorithms for MPI_Bcast operation:
Generalized Broadcast
This algorithm supports various virtual topologies and message segmentation. In the algorithm, we first build required tree structure (Linear (Flat tree), Binomial, Binary (General tree), (multiple) Pipeline), such that parent/children relations are established among the processes. Then we proceed in following way: In our ``reduced'' method representation these methods are referred to as ( mathend000# Bcast, mathend000# topology, mathend000# segment size ) mathend000#.

Splitted Binary Broadcast
This algorithm uses Binary tree topology and message segmentation. The algorithm consists of two phases: scattering phase and exchange phase. At the beginning, we build binary tree topology, and root splits message into two: left half and right half. The scattering phase is implemented in the following way: In the exchange phase, every process finds its ``pair'' in the opposite subtree and exchanges corresponding halves of the message, so that at the end of the phase, all processes have complete message. In case, when number of processes is even, the last node exchanges message with root.

MPI_Reduce

Reduce operation combines the elements provided in the input buffer of each process in the group using the specified operation, and returns the combined value in the output buffer of the root process. The following Reduce algorithms are currently available:
Generalized Reduce
This algorithm supports various virtual topologies and message segmentation. In the algorithm, we first build required tree structure (Linear (Flat tree), Binomial, Binary (General tree), (multiple) Pipeline), such that parent/children relations are established among the processes. Then we proceed in following way:

OCC implementation of MPI_Reduce currently supports only following native MPI Datatypes: MPI_BYTE, MPI_SHORT, MPI_UNSIGNED_SHORT, MPI_INT, MPI_UNSIGNED, MPI_LONG, MPI_UNSIGNED_LONG, MPI_FLOAT, MPI_DOUBLE; and following native operations: MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND, MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR.

We are thinking whether to include Reduce algorithm proposed by Rabenseifner in [5] but at this time, the algorithm is not included.

In our ``reduced'' method representation these methods are referred to as ( mathend000# Reduce , mathend000# topology, mathend000# segment size ) mathend000#.

MPI_Scatter

Scatter operation is used to distribute data evenly among the processes within a group: send buffer at the root process is split into N equal parts and the parts are sent to the corresponding processes. The following Scatter algorithms are currently available:
Linear
In this algorithm, root sends appropriate message directly to every child node.
Binomial Broadcast
This algorithm is used for scattering small messages using binomial broadcast tree.
Binary Scatter
This algorithm is used for scattering small message using binary tree topology and splitted broadcast approach: At every step, parent takes its data, and splits remaining message into two contiguous chunks and sends them down the tree. To ensure that we are dealing with contiguous messages, root performs data shift. At every level, when child receives a message, the first data block in that message will belong to it, and the rest of the data will be distributed to its children. For this function we do not use topology building functions but compute children and parent ranks on the fly.
Neither of available Scatter algorithms supports segmentation.

MPI_Alltoall

Alltoall is used to exchange data among the all processes in the group. The operation is equivalent to all processes executing the scatter operation on their local buffer. The OCC currently provides following Alltoall implementations:
Linear
This is simplest Alltoall implementation: every process posts non-blocking receives and sends for all necessary data blocks. We try to order requests such that minimum number of conflicts occurs.
Pairwise exchange
This algorithm takes P mathend000# communication steps, on communicator of size P mathend000#. At every step, process r mathend000# receives data block from process (r - step) mathend000# and sends the necessary data block to rank (r + step) mathend000#, with wraparound.
Bruck
This algorithm is described in [1]. It is intended for small message sizes as decreases latency of Alltoall operation at increase in bandwidth requirements. It takes $ \lceil$log2(P)$ \rceil$ mathend000# steps. It consists of three phases:
  1. local shift - the send buffer on every process is shifted by process rank blocks up.
  2. communication - at every step k mathend000#, process r mathend000# sends all blocks whose kth mathend000# bit is 1, to process r + 2k mathend000#, and receives the same blocks from process r - 2k mathend000# with wraparound.
  3. local unshift - opposite from first step
Ring
This algorithm is intended for linear physical network topology. It is similar to Bruck algorithm, except on communicator of size P mathend000# it takes P mathend000# communication steps. On every communication step k mathend000#, processes exchange all new data with their left and right neighbors.
Scheduled Linear
This algorithm is similar to the Linear algorithm except in this case processes start sending messages in order.

MPI_Allreduce

Allreduce operation combines the elements provided in the input buffer of each process in the group using the specified operation, and the result of operation is available in output buffer of all processes. We built the MPI_Allreduce function on top of OCC's general reduce and broadcast functions:
General Allreduce
This algorithm executes reduce to process 0, followed by the broadcast of the result from the process 0. All available virtual topologies can be used to implement this function. This algorithm also supports segmentation of the messages.


Datatype Support

The OCC provides datatype support for verification and performance measurement purposes. In order for rest of the OCC library to be able to make usage of data type functionality functions, the following functions must be modified and corresponding functions must be defined in
Datatypes/OCC_Datatypes.c: OCC_Datatype_constructor,
OCC_Datatype_destructor, OCC_Data_initializer, OCC_Data_verifier,
OCC_Data_remover, OCC_Data_displayer, and OCC_Data_array_displayer.

In order for user to use datatype support it must go through process of creating datatype, initializing data buffer, possibly modifying buffer using MPI collective operations or manually (not suggested), verifying buffer contents, deallocating data buffer and freeing the data type.

Buffer initialization and verification routines take rank of the ``source'' process and communicator size in order to initialize and verify buffer contents. The value which is expected in the result buffer depends on collective that took place - so Reduce and Allreduce are analyzed as a special case.

Currently, the OCC provides following datatypes:

Basic
The basic datatype support provides basic MPI datatypes support: MPI_BYTE, MPI_CHAR, MPI_INT, MPI_FLOAT, MPI_DOUBLE. User refers to these datatypes by their name with MPI prefix. The implementation of this datatype can be found at Datatypes/occ_data_basic.c.
Simple struct
This datatype represents simple structure consisting of a single character byte and one double. Type size and extent of this data type differ so it is especially useful for finding out errors which may occur due to segmentation. User refers to this datatype as sstruct. The implementation of this datatype can be found at
Datatypes/occ_data_sstruct.c.

The existing datatype implementations can be used as code examples for implementing new datatypes. Simple datatype test programs are defined in Datatypes/test_occ_data_basic.c and Datatypes/test_occ_datatype.c.


Tests

The tests module supplies support for simple collective verification functions and simple micro-benchmarks. At this time, the both tests share same structure determined by OCC_Test_t structure. The library also supplies basic data analysis tools.

In general, tests follow the following structure: On given communicator, run complete test specified number of times. Complete test consists of running collective specific test function for specified algorithms, segment sizes, and data counts. Every registered data point is generated from certain number of sample points (we usually record minim, average value and standard deviation of the measurement), which in turn are average of specified number of measurements.

Test structure

The OCC_Test_t structure contains detailed information about the test that is about to be executed: type of test, name of the collective, number of test repetitions, number of sample points for every data point, number of measurements for every sample points (often determined dynamically), minimum and maximum data count, minimum and maximum communicator size on which we would like to execute tests, root of operation (if applicable), datatype name (as defined in Datatypes module), whether to test native implementation of collective, segment sizes, algorithms to be tested, function names, and communicator on which we would like to execute the test.

The test is initialized from command line arguments. The type of the test is set in the corresponding test's main program. The following is the list of arguments currently recognized by test initialization function. Required arguments are denoted as such.

-coll <name of collective> - name of the collective to be tested without MPI_ prefix! [Required]

-algs [<all> | <number> <list of algorithms>] - which algorithms to test. all means all available, or user can specify number of algorithms they plan to test and follow it up by name of the algorithm as defined in Section 2.3 (for example: Linear, Pipeline, Splittedbinary, Recdbl, etc.). [Required if -native flag is not set]

-native - flag specifying whether to test native collective implementation (when present - we will test it) [Required if -algs are not specified]

-segments <number of segment sizes> [segment sizes] - specifies how many and which segment sizes to use in tests. 0 means no segmentation, less than 0 means use default segment sizes. Default is no segmentation.

-datatype <datatype name> | -ddt <datatype name> - the datatype to be used in the test, as defined in Section 2.4. Default values depend on collective.

-f <result file name> - name of the output file. Default file name is perf_Collective. Results are appended to the file.

-mincount <number> - minimum data count (not the size!) Default value is 1.
-maxcount <number> - maximum data count (not the size!) Default value is 131072 (128Kb elements)

-mincsize <number> - minimum communicator size for the test Default is 2.

-maxcsize <number> - maximum communicator size for the test Default is size of MPI_COMM_WORLD.

-root <number> or -r <number> - rank of the root Default is 0. This is probably the best value if you plan to use more than one communicator size.

-numtest <number> - number of test repetitions. Default is 1.

-numsample <number> - number of sample points for every data point. Default depends on test type.

-nummeasure <number> - number of test repetitions. This value is overridden by the dynamic value computed at run time, however it is used as first input.

More information on and examples of how to run these tests is provided in User manual.

Method Verification

The method verification tests are defined in OCC_Ver_Test.c.

Default verification tests parameters that differ from general test are single test, single sample point, single measurement per sample point. The output of verification test goes to stdout and is in the form: (Algorithm name, Communicator size, Data count, Segment size [bytes], passed/failed).

In general, collective specific verification tests consists of following steps:

  1. Obtaining correct function pointers for specified datatype from Datatypes module.
  2. Commit datatype
  3. For element counts between mincount and maxcount (next data count is determined dynamically)
  4. Free datatype using function from Datatypes module.

Note that data verification must be handled with care: for example for Broadcast it is enough to simply specify root's rank to verify contents of the buffer; however for Alltoall we have to check values in block by block fashion.

Performance Measurements

The performance measurement tests are defined in OCC_Perf_Test.c. The tests are set of micro-benchmarks for each of the collectives. The output of the performance measurement test is in form:
(Algorithm name, Communicator size, Message size[bytes], Segment size[bytes], minum duration[$ \mu$sec mathend000#], average duration[$ \mu$sec mathend000#], standard deviation[$ \mu$sec mathend000#], maximum ``report-to-root'' time for that measurement).

The special care was given to the timing of collective operations due to issues related to it:

  • Resolution of system clock can be insufficient to correctly measure time it took for collective operation to complete. Solution is to repeat collective operation number of times.
  • Collective operation can take dramatically different times on different processes: consider non-segmented binary tree broadcast: root is done as soon as it sends two messages. In case we have 127 nodes, the tree depth will be 7 levels - root will be done even before 3rd mathend000# level receives message!
  • Related to previous issue are pipelining effects. The pipelining effects may occur becuase we can restart new collective operation as soon as we are done (ignoring the fact that the collective operation may not be completed yet).
  • Caching effects in case we work with small messages.

To prevent pipelining effects, and actually measure time it took for the collective to finish, time is measured only at the root node (if root node is not defined, node 0 is selected to be root), and after every collective call, ``report-to-root'' step is introduced to ensure that all processes completed their operation. Unfortunately, the ``report-to-root'' affects the time we measure, especially for small messages and small number of nodes as the time it takes to perform ``report-to-root'' operation is comperable, and sometimes even longer than executing collective. Again, repeating this measurement multiple number of times, and collecting enough sample points can still drive standard deviation of measurement sufficiently low.

At this time, we do not address caching effects that may occur for small messages.

Default performance tests parameters that differ from general test are single test, ten sample point, number of measurement per sample point is dynamically determined.


Future work and development

In the future, we plan to investigate the following directions for OCC development:
  • Increasing number of available methods in OCC.
  • Additional used-defined datatype support. The flexibility OCC has with datatypes can help us perform application level optimizations by measuring the performance using custom datatypes.
  • Work on automatic collective tuning process. Currently, the performance measurements results are analyzed off-line by Matlab programs. This information is then used to manually generate the decision function for FT-MPI collective operations. We are currently working on replacing the manual process of parsing generated decision function and turning it into code blocks by automatic one (using quad-tree algorithms).

TODO list and Development ideas

  • Find and remove all bugs :-)
  • Maybe we should have occ_init() and occ_finalize() functions to deal with MPI_COMM_WORLD attributes and things like that. A few bytes here and there are still leaked bytes.
  • Check if it is possible to combine all collective dependent test functions into same function instead of test function per collective.
  • Create configure, make, make install tools...


User/Developer Manual

USER/DEVELOPER MANUAL


Installation

We assume you obtained occ_dist.tar.gz file either from the CVS tree, the website, or via email by contacting pjesa@cs.utk.edu.

Unfortunately, library does not come with standard configure, make, and make install scripts. But it does come with sample makefiles for three different MPI implementations: FT-MPI, MPICH (1), and MPICH 2. The idea is that generated libraries will not have name conflicts so we can use them all at the same time.

Initial Installation

The steps to install the library are:
  1. Create directory where you would like to install library:
    mkdir CollOps
  2. Unpack the archive in the directory created in step 1:
    tar -xzvf occ_dist.tar.gz
    This will create following directory structure:
    Datatypes libs makefile Methods occ_dist.tar Tests
  3. Go to Methods, Datatypes, and Tests directories and fix Makefiles_* such that they use correct values for mpicc, mpirun, and similar scripts. Do not forget to set correct value for OCCPATH! Go back to CollOps directory and modify makefile to reflect the changes in supported MPI implementations.
  4. If you would like to build copy of library and test programs for FT-MPI, MPICH, and MPICH 2 in CollOps directory run: make ftmpi-clean make ftmpi
    and make mpich-clean make mpich
    and make mpich2-clean make mpich2
    respectively.
These commands will generate statically linked libraries in the libs directory and corresponding executables in the Test directory.
The MPI distribution name will be appended to the name of the library/file, for example: liboccmpich.a, vtest_ftmpi, and ptest_mpich2.

Copying the Latest Library to Another System

Assume we set up makefiles on two or more systems, and then we updated library code on one of the systems. If we do pure cvs checkout, or copy over whole distribution we will have to update makefiles manually every time we do this.

For this reason, we provided pack target for make which archives all source and header files but does not archive the makefiles. Of course, if we added new files to the library, we will have to manually update Makefiles on other system anyway.

You should run make pack in CollOps directory, and then copy the occ_pack.tar to wherever you need to.

We hope that this feature will become obsolete when we add configure support...


Running Tests

In general, the tests are run in the following manner:
./testname -coll <Collective name>
<[-algs <all | number followed by list>] | [-native]>
[optional arguments]

For more information on optional arguments check Section 2.5. As always, make sure correct startup scripts are used!

Verification

The verification test is defined in Tests/OCC_Ver_Test.c and the corresponding executable is named vtest_*. The section 2.5 gives more detailed description of verification tests and how are they built. Here, we only provide couple of examples on how to run tests. Keep in mind that the output of this function goes to stdout. As long as the lines are flying by you with ``passed'' at the end, everything is fine.
Verifying all available Broadcast algorithms with default values:
mpirun -np 8 ./vtest_mpich -coll Bcast -algs all
This is generaly bad idea since the default values will test all datapoints from 0 - 1Kb elements, and then very dense tests up to 131072 elements. In this case, default data type is MPI_BYTE. In general, the test description should be more specific.
Verifying all available Reduce algorithms with 2 segment sizes: no segmentation and segments of size 128 bytes, datatype used in tests should be double, and minimum count is 100, maximum count is 1000.
mpirun -np 32 ./vtest_mpich2 -coll Bcast -algs all
-ddt MPI_DOUBLE -segments 2 0 128
-mincount 100 -maxcount 1000
Verifying Splittedbinary and Pipeline Broadcast algorithms with 3 segment sizes: 64, 256, and 1024 bytes, SSTRUCT datatype, minimum count 100, maximum count 200.
ftmpirun -o -s -np 16 ./vtest_ftmpi -coll Bcast
-algs 2 Splittedbinary Pipeline
-ddt sstruct -segments 2 0 128
-mincount 100 -maxcount 1000

Performance Measurements

The performance test is defined in Tests/OCC_Perf_Test.c and the corresponding executable is named ptest_*. The section 2.5 gives more detailed description of performance tests and how are they built. Here, we only provide couple of examples on how to run tests. Note that there

Note that the result files for the performance tests are appended, to the existing file, with header so one can distinguish different runs.

Benchmarking all available Alltoall algorithms without segmentation:
mpirun -np 8 ./ptest_mpich -coll Alltoall -algs all
This will work but in general, the test description should be more specific.
Benchmarking all available Reduce algorithms with 2 segment sizes: no segmentation and segments of size 128 bytes, datatype used in tests should be double, and minimum count is 100, maximum count is 1000.
mpirun -np 32 ./ptest_mpich2 -coll Reduce -algs all
-ddt MPI_DOUBLE -segments 2 0 128
-mincount 100 -maxcount 1000
Benchmarking Splittedbinary and Pipeline Broadcast algorithms with 3 segment sizes: 64, 256, and 1024 bytes, SSTRUCT datatype, minimum count 100, maximum count 200. We would like to execute two complete tests
ftmpirun -o -s -np 16 ./ptest_ftmpi -coll Bcast
-algs 2 Splittedbinary Pipeline
-ddt sstruct -segments 2 0 128
-mincount 100 -maxcount 1000 -numtest 2


Known Bugs

None at this time
Please report them!

Bibliography

1
J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby.
Efficient algorithms for all-to-all communications in multiport message-passing systems.
IEEE Transactions on Parallel and Distributed Systems, 8(11):1143-1156, November 1997.

2
G. E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, A. Bukovsky, and J. J. Dongarra.
Fault tolerant communication library and applications for high performance computing.
In LACSI Symposium, 2003.

3
J. Pješivac-Grbovi\textrm{\'{c\/}}, T. Angskun, G. Bosilca, G. Fagg, E. Gabriel, and J. Dongarra.
Performance analysis of MPI collective operations.
In Proceedings of 19th International Parallel and Distributed Processing Symposium, PMEO-PDS Workshop. IEEE Computer Society, April 2005.

4
R. Rabenseifner.
Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512.
In Proceedings of the Message Passing Interface Developer's and User's Conference, pages 77-85, 1999.

5
R. Rabenseifner and J. L. Träff.
More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems.
In Proceedings of EuroPVM/MPI, Lecture Notes in Computer Science. Springer-Verlag, 2004.

6
R. Thakur and W. Gropp.
Improving the performance of collective operations in MPICH.
In J. Dongarra, D. Laforenza, and S. Orlando, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 2840 in LNCS, pages 257-267. Springer Verlag, 2003.
10th European PVM/MPI User's Group Meeting, Venice, Italy.

Index

About this document ...

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -html_version 4.0,math -local_icons -dir /home/pjesa/www-home/projects/occ -split 0 -toc_depth 3 -link 6 -accent_images textrm -t 'Optimized Collective Communication Library' occ_manual.tex

The translation was initiated by Jelena Pjesivac-Grbovic on 2005-12-12


Footnotes

... methods2.1
Section 2.2 describes differece between method and algorithm in detail
Jelena Pjesivac-Grbovic 2005-12-12