What is the syntax and usage model for the Dual processor BLAS, and why does it work on some linux distributions and not others? The ASCI Red Supercomputer at SNL has an unsupported but ancient feature developed by Dr. Stephen Wheat formerly at SNL for SUNMOS (TM) (Sandia and University of New Mexico Operating System), which is an older light weight kernel predating ASCI Red's "Cougar", which was ported to the Intel Paragon supercomputer. For some reason, the Intel ASCI Red supercomputer kept this feature for backward compatibility. It is a simple method of obtaining task parallelism, and can be implemented using pthreads. The dual processor BLAS on the ASCI Red Distribution webpage still use "cop" to obtain dual processor performance. All dual processor functionality is found in one binary: cop_dual.o, and it can be implemented in just a few lines of C code. I've yet to find a few lines, however, that work in all OSs with all versions of libpthread.a. "COP" does nothing that parallelizing compilers do not already do, but it does give the user explicit management over their shared memory parallelism. The syntax is: void cop( routinename_addr, holdflag, input_structure) void (*routinename_addr)(char *); volatile int *holdflag; char *input_structure; {} cop_init(1) starts a single thread up into a sleeping state, waiting for the execution of cop(). cop_init(1) is a not a user routine on ASCI Red's cougar, since it is executed automatically at start up, but it is necessary for the linux version of the library. Once cop() is called (only on the main thread), the main thread returns immediately to do other work. The sleeping (slave) thread initiated by cop_init(1) wakes up and executes the following: routinename_addr( input_structure ) ; ++holdflag; The slave thread then returns to a sleep state waiting for the next cop call. While the slave thread is busy calling the routine passed, the main processor can be off doing other tasks in parallel. We can block when the volatile int flag has been incremented. Typically, the input_structure is a pointer to a structure of several variables because most routines require several parameters. This, of course, is irrelevent to the implementation however. The difference between most of the single processor and dual processor BLAS look something like: flag = 0; #ifdef SINGLE routine_name(¶ms); flag = 1; #else cop( &routine_name, &flag, ¶ms ); #endif /* main thread does some work */ while ( flag == 0 ); Here is a sample code that uses COP to add one to every element in a 12 element array (obviously good for illustration only): #include #define NUMPROCS 2 volatile int hold0=0; struct routinetype { int N; }; int x[12]= {0,1,2,3,4,5,6,7,8,9,10,11}; void routine(struct routinetype *params) { int i, j; j= params->N; for (i= 1; i < j; i+=NUMPROCS) x[i] += 1; } void main() { static struct routinetype params; void routine(struct routinetype *params); static int i; params.N= 12; hold0= 0; cop_init_(1); cop(&routine,&hold0,¶ms); for (i= 0; i < 12; i+= NUMPROCS) x[i] += 1; while ( hold0==0 ) ; for(i= 0; i < 12; i++) printf("x[%2d]=%2d\n",i,x[i]); } The implementation I have for cop_dual.c looks something like: #include #include "pthread.h" volatile int gmhstate[2]= {0,0}; int _my_proc_mode = 2; /* This is used to satisfy external references, but it is not used by this code */ void cop(routine,hold,params) void (*routine)(char *); volatile int *hold; char *params; { static void (*routine1)(char *); static int *hold1; static char *params1; void gmhslave(int id); const int i_one = 1; static int times = 0; if ( gmhstate[0] == 0 ) { if ( times == 0 ) { /* pthread_init(); pthread_yield(); */ cop_init_( 1 ) ; times = 1; } routine1= routine; hold1= hold; params1= params; gmhstate[0]= 1; } else { (*routine1)(params1); *hold1= 1; gmhstate[0]= 0; } } void gmhslave(id) int id; /* SPIN loop for COP interface */ /* There are 3 states: 0= Spin, 1= Run with C, 3= Stop */ { void (*routine)(char *); volatile int *hold; char *params; while ( gmhstate[id] != 3 ) { if ( (id==0) && (gmhstate[0]==1) ) cop(routine,hold,params); if ( (id==1) && (gmhstate[1]==1) ) cop(routine,hold,params); } } cop_init_(number) int *number; { int i, status; pthread_t *thread; pthread_attr_t *attr; void gmhslave(); thread= (pthread_t *) malloc((*number)*sizeof(pthread_t)); for (i= 0; i < *number; i++) { status= pthread_create(&thread[i],attr,(void *) gmhslave,(void *)i); if ( status < 0 ) fprintf(stderr,"pthread_create error\n"); } } If anybody knows how to implement this functionality in a way that is completely compatible across multiple OSs, please let me know, and I'll distribute your solution. - Greg Henry - greg.henry@intel.com