PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)

Updated: 6 February 2004

Parallelization of GSL: Performance of Case Studies

J. Aliaga (1), F. Almeida (2), J.M. Badia (1), S. Barrachina (1), V. Blanco (2), M. Castillo (1), U. Dorta (2), R. Mayo (1), E.S. Quintana (1), G. Quintana (1), C. Rodriguez (2), and F. de Sande (2)
Affiliations:
(1) Depto. de Ingenieria y Ciencia de Computadores,
Univ. Jaume I,
12.071-Castellon, Spain;
{aliaga,badia,barrachi,castillo,mayo,quintana,gquintan}@icc.uji.es.
(2) Depto. de Estadistica, Investigacion Operativa y Computacion,
Univ. de La Laguna,
38.271-La Laguna, Spain;
{falmeida,vblanco,casiano,fsande}@ull.es

The GNU Scientific Library (GSL) is a collection of hundreds of routines for numerical scientific computations coded in ANSI C, which include routines for complex arithmetic, matrices and vectors, linear algebra, integration, statistics, and optimization, among others.

One of the reasons GSL was never parallelized seems to be the lack of a globally accepted standard for developing parallel applications. We believe that with the introduction of OpenMP and MPI the situation has changed substantially. OpenMP has become a standard de facto for programming shared-memory multiprocessors, where the parallelism is exploited using multiple threads, while MPI is nowadays accepted as the standard interface for developing parallel applications using the message-passing programming model.

In this paper we will employ two classical numerical computations to explore the challenges and difficulties involved in the parallelization of GSL, both on multithreading and message-passing environments. In particular, our first case study will pursue the parallelization of the unstructured sparse matrix-vector product. Due to the irregular structure of the matrix, special care must be taken in this operation in order to balance the computational load and obtain an acceptable speed-up of the parallel algorithm. The second operation considered here will be the fast Fourier transform (FFT). While the parallelization of this operation is simple when both the data size and the number of processes are a power of two, in the general case the communication costs of the parallel algorithm can be largely penalized.

We expect to report performance results for shared-memory parallel architectures, multicomputers, and clusters of networked personal computers. From the experimental results of these two case studies we expect to acquire the necessary information to decide on an appropriate approach and architecture to parallelize certain parts of GSL.

Home page


2004-02-06