Masha Sosonkina
SCL Ames Laboratory
Iowa State University
Ames IA 50011, USA
email: masha@scl.ameslab.gov
ABSTRACT:
High-performance applications place great demands on the computation and communication resources of a distributed computing platform. If the availability of the resources changes dynamically, the application performance may suffer. This is especially true for cluster environments, which are often heterogeneous and require tedious tuning for high-performance applications. It is desirable to make an application aware of system run-time changes and to adapt it dynamically to the new conditions. To accomplish this, several options exist. We have chosen to use a helper tool (middleware NICAN), which (1) enables timely and light-weight desired system analysis, (2) delivers its critical results to the application, (3) invokes (if necessary) application adaptations as reaction to changes in system conditions. Non-intrusive co-existence with application as well as low overhead are key features of the tool that enable seamless interaction with application.
In this talk, we first outline the design of NICAN middleware, then describe a packet probing technique in NICAN to detect contention on the cluster nodes to which application is mapped. The technique is light-weight and may be a priori tuned to a given network type, so that it is used at application's run-time. Finally, after showing how to integrate this technique with a sample scientific application parallel Algebraic Recursive Multilevel Solver (pARMS), we emphasize the runtime adaptations that may be triggered in pARMS based on the current conditions. Numerical experiments, which target faster iterative convergence of pARMS, are performed for various computing system configurations and conditions.