PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)

Updated: 6 February 2004

Interaction of Cache, Communication, and Load Increase on SMP Clusters for Parallel Adaptive FEM

Judith Hippold and Gudula Ruenger
Chemnitz University of Technology
Department of Computer Science
09107 Chemnitz, Germany
email: juh,ruenger@informatik.tu-chemnitz.de

Many problems in natural sciences and engineering can be modeled by partial differential equations (PDE) and solved with the popular finite element method (FEM). The fundamentals are a discretization of the physical domain into a mesh of finite elements and the approximation of the unknown solution function by a set of shape functions on those elements. Although adaptive mesh refinement was designed to reduce computational effort and memory needs the simulation of real-life problems with finite elements still requires parallel computers, especially for 3-dimensional, hexahedral meshes. We have implemented a parallel realization for distributed memory which uses MPI and distributes the finite elements over the address spaces of the different processors.

Clusters of SMPs (symmetric multiprocessors) gain more and more importance in high performance computing because they provide large computational power for a reasonable price. Tests with the parallel adaptive FEM implementation on this two level architecture expose a variety of performance determining dependences involving cache exploitation or the number of MPI processes per node. The allocation of MPI processes to the physical processors can reduce communication overhead because SMP internal communication might be realized more efficiently than communication between the cluster nodes. Assigning less MPI processes than actual processors to one cluster node does not completely use the available resources but delays cache effects since the number of finite elements to process per cluster nodes increases with the number of running MPI processes per node. Besides these factors caused by hardware characteristics or process design, there is a strong irregular behavior caused by the adaptivity of mesh refinement. The number of finite elements increases dynamically and irregularly during a program run with each adaptive refinement step and may cause a huge load increase within a single program run and also a large load imbalance in later iteration steps. But tests have shown for example that load imbalance caused by adaptive refinement can still result in very good speedups if the cache size and number of processors have a certain characteristics. Altogether the most efficient implementation strategy is difficult to choose.

This paper examines the interaction of cache size, communication, dynamic load increase and dynamic load imbalance of adaptive FEM. Especially we focus on the following dimensions of dependencies: the occurrence of cache effects due to AMR caused increased load, the occurrence of cache effects due to multiple MPI processes on a single SMP, and the influence of internal SMP communication vs. network communication on program execution time. All effects are closely related and influence each other. Also the strong input dependence of the adaptive FEM may result in different program behavior for each application problem to be solved. The goal of the paper is to quantify the interactions of dynamic program behavior and performance due to cache effects and communication and to find a first approach for a load redistribution strategy which takes the hardware platform into account in order to find a trade-off between redistribution, communication and cache exploitation. As hardware platforms we use a Xeon SMP cluster with 16x2 processors, a SunBlade1000 SMP cluster with 4x2 processors, and an IBM Regatta System.

Home page


2004-02-06