Galyuk, Yu.P.2, Memnonov, V.P.1, and Zolotarev, V.I.3
1 St.Petersburg State University, Dep. Math. Mech., Universitetski
pr.28, St.Petersburg, Star.Peterhoff, 198504, Russia
2 Petrodvorets Telecommunication Center, Ulyanovskaya st.1,
St.Petersburg, Star.Peterhoff, 198504, Russia
email: galyuck@paloma.spbu.ru
3 Petrodvorets Telecommunication Center, Ulyanovskaya st.1,
St.Petersburg, Star.Peterhoff, 198504, Russia
email: viz@ptc.spbu.ru
With the help of distributed Monte Carlo simulation on several parallel clusters we study flows in very narrow channels in transitional regime. In spite of the small values of mean gas velocities in comparison with the thermal molecular speeds in this problem a significant reduction of statistical scattering in the simulations was achieved by enlargement of statistical samples through an employment of a multicluster system.
At the same time a specific problem, related to trouble-free operation of a somebody else, remote hardware technique, which always comes into play on such an occasion, was resolved with the help of two fault tolerant algorithms, developed by us for distributed computing under MPI. The paper contains description of an error monitoring and fault management systems used in these algorithms and some evaluation of their efficiency and cost. They are shown to enchance the reliability for Monte Carlo simulations, only partly diminishing the total statistical sample in the case of a computer breakdown in any cluster. And as the error of these methods depends upon the sample size only as the square root so in many cases it will be preferable to disconnect all dependences of the "misfortune" cluster with the others in proper time moment and to save solution on the remaining clusters. This is just one of the duties of the abovementioned fault management system. Thus the whole problem is not interrupted and stopped, only total statistical sample being diminished by the portion initially intended for that "misfortune" cluster. And even these losses could be very much diminished if in the user application some checkpointing procedure has been provided. They could be also easily built in the other statistical programs [1].
Of coarse it is an end-user reliability solution for though specific but rather large class of applications. At the same time general projects like Fault Tolerant MPI of the Tennessee University and other general solutions [2] seem to be not quite ready as yet for general practice.
It is also now in progress development of dynamical load balance technique for clusters like that already used by us for processors in [3].
References:
[1]S.Denisikhin, V.Memnonov, and S.Zhuravleva. Parallel implementation
of the DSMC Method coupled with a Continuum Solution: Simulation of a
Lubrication Problem in Magnetic Disk Storage. In P.M.A. Sloot, D.
Abramson, A.V. Bogdanov, J.J. Dongarra, A.Y. Zomaya, and Yu.E. Gorbachev
Eds. ICCS 2003, LNCS 2658, pp.565-574, 2003, Springer-Verlag Berlin
Heidelberg 2003.
[2]S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou. MPI-ft:
Portable fault tolerance scheme for mpi. In Parallel Processing Letters,
2000, Vol.10, No. 4, 371-382.
[3]Y.P. Galyuk, V.P.Memnonov, S.E.Zhuravleva and V.I.Zolotarev: Grid
Technology with Dynamic Load Balancing for Monte Carlo Simulation, in:
Applied Parallel Computing, Lecture Notes in Computer Sience, Eds. J.
Fagerholm, J. Haataja, J. Jarvinen, M. Lyly, P. Raback, V. Savolainen,
Springer-Verlag, Berlin Heidelberg New York (2002), 515-520.