Xing Cai
Simula Research Laboratory
and Department of Informatics
University of Oslo
Norway
Abstract:
The focus of this paper is on the performance of a particular class of parallel applications that numerically solve partial differential equations (PDEs) on unstructured meshes. More specifically, we only consider such parallel PDE solvers that are based on the mathematical principle of additive Schwarz iterations. These domain decomposition algorithms are inherently parallel, rapid in their numerical convergence, and have a simple algorithmic framework allowing extensive re-use of existing sequential PDE software components. However, overlapping subdomains are needed in this class of domain decomposition algorithms for ensuring the fast numerical convergence. Consequently, the overlapping regions contribute to increased overhead, considering the "extra" number of degrees of freedom per subdomain. To restrict this negative effect, we categorize the degrees of freedom per subdomain into three groups. The first group participates in all types of local computations, while the second group avoids only the local computation that is involved in computing a global inner product between two vectors, whereas the third group avoids local computations that are associated with both computing a global inner product and doing a global matrix vector multiply. To achieve this grouping of the subdomain degrees of freedom, we need a partial re-partitioning of the global degrees of freedom on an unstructured mesh. This must be done following certain mathematical constraint and load balancing requirement. Moreover, the degrees of freedom per subdomain must also be sorted according to this grouping, so that different software components can make best use of memory contiguity.
As another source of overall performance enhancement, we also look at the topic of improving the performance of sequential PDE software components. Relevant issues in this respect include making use of the symmetry of the mathematical problem, avoiding unnecessary function calls, and improving cache re-use in the most computationally intensive component in a PDE application, i.e., the matrix vector multiply. We will demonstrate the combined effect of overhead reduction and sequential software tuning by the example of a large-scale simulator of electrical activities in the heart.