PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)

Updated: 10 February 2004

Applying parallel direct solver skills to build robust and highly performant preconditioners

Pascal Hénon, Pierre Ramet, François Pellegrini and Jean Roman
ScAlApplix Project
INRIA Futurs, France
and
Yousef Saad
University of Minnesota, USA
emails: {roman,ramet,henon,pelegrin}@labri.fr

Solving large sparse linear systems by iterative methods has often been quite unsatisfactory when dealing with pratical "industrial" problems. The main difficulty encountered by such methods is their lack of robustness and, generally, the unpredictability and unconsistency of their performance over a wide sample of different problems; certain methods work quite well for certain types of problems but can fail completely on other problems.

Over the past few years, direct methods have made significant progress thanks to either the combinatorial analysis of the Gaussian elimination process and the parallel algorithmic of blockwise solvers optimized for modern parallel supercomputers. Its is now possible to solve practical three-dimensional problems in the order of several millions of equations in a very powerful way with the direct solvers that efficiently use the superscalar effects of modern processors and take all the benefits from the computer architectures based on networks of SMP nodes (IBM SP, DEC-Compaq, for example). However, direct methods may fail to solve very large three-dimensional problems, due to the large amount of memory needed for these cases.

In our work, we consider an approach which, we hope, will bridge the gap between the two classes of methods. The goal is to provide a method which exploits the parallel blockwise algorithmic used in the framework of the high performance sparse direct solvers for developping robust parallel incomplete factorization based preconditioners for iterative solvers. The idea is then to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorization can take advantage of the latest breakthroughts in sparse direct methods and therefore be very competitive in CPU time (effective power used from processors) while avoiding the memory limitation encountered by direct methods. By this way, we expect to be able to solve systems in the order of hundred millions of unknowns. Another goal is to analyse and justify the chosen parameters that can be used to define the block sparse pattern in our incomplete factorization.

The driving rationale for this study is that it is easier to incorporate incomplete factorization methods into direct solution software than it is to develop new incomplete factorizations. In that way, we use, as a starting point, the algorithms and the software components of PaStiX that is a parallel high performance supernodal direct solver for sparse symmetric positive definite systems and for sparse unsymmetric systems with a symmetric pattern. These different components are:
(1) the ordering phase, which computes a symmetric permutation of the initial matrix $A$ such that factorization process will exhibit as much concurrency as possible while incurring low fill-in. In this work, we use a tight coupling of the Nested Dissection and Approximate Minimum Degree algorithms;
(2) the block symbolic factorization phase, which determines the block data structure of the factorized matrix associated with the partition resulting from the ordering phase. This structure consists of several column-blocks, each of them containing a dense symmetric diagonal block and a set of dense rectangular off-diagonal blocks;
(3) the block repartitioning and scheduling phase, which refines the previous partition by splitting large supernodes in order to exploit concurrency within dense block computations in addition to the parallelism provided by the block elimination tree, both induced by the block computations in the supernodal solver. In this phase, we compute a mapping of the matrix blocks (that can be made by column block (1D) or block (2D)) and a static optimized scheduling of the computational and communication tasks according to BLAS and communication time models calibrated on the target machine. This static scheduling will drive the parallel factorization and the backward and forward substitutions.

Our approach consists in computing symbolically the block structure of the factors that would have been obtained with a complete factorization, and then deciding to drop off some blocks of this structure according to relevant criterions. This incomplete factorization induced by the new sparse pattern is then used in a preconditioned GMRES or Conjugate Gradient solver. Our main goal at this point is to achieve a significant diminution of the memory needed to store the incomplete factors while keeping enough fill-in to make the use of BLAS3 primitives profitable. Naturally, we must keep using the ordering phase and the mapping and scheduling phase (this phase is modified to suit the incomplete factorization block computation) to ensure an efficient parallel implementation of the block preconditioner computation and of the forward and backward substitutions in the iterations.

The first crucial point is then to find a good initial partition that can be used in the symbolic factorization (phase 2). It cannot be the initial supernodal partition computed by the ordering phase (phase 1) because it would be too costly to consider the diagonal blocks as dense blocks like in a complete factorization. Therefore, we use a refined partition of this supernodal partition that will then define the elementary blocks of the factors. We obtain the refined partition by splitting the column blocks issued from the supernodal partition according to a maximum blocksize parameter. This allows more possiblities to drop some blocks in the preconditioner. This blocksize parameter is an important key to find a good trade-off as described at the beginning of this paragraph. An important result from theorical analysis is that if we consider nested dissection ordering based on separations theorems and if we introduce some asymptotically refined partitions, one can demonstrate that the total number of blocks computed by the block symbolic factorization and the time to compute these blocks are still linear (this result is also true with the supernodal partition).

The second crucial point concerns the various criterions that are used to drop some blocks from the blockwise symbolic structure induced by the refined partition. The dropping criterions are, for example, the level-of-fill of a block in the block elimination process, the number of times a block contributes to a modification on another block or the number of times a block is modified by a contribution from another block. Remaining in the same theoretical context than in the previous paragraph, one can show that the number of operations needed to compute these criterions has a quasi-linear complexity.

A preliminary presentation of this work has been made at PREC'03, we will present here much more analysis of the dropping rules and experimemts on large 3D industrial problems.

Home page


2004-02-10