MPP Performance Optimization

2/2/98


Click here to start


Table of Contents

MPP Performance Optimization

Outline

Performance

Performance Examples

Issues in Performance Problem Size and Precision

Issues in Performance Execution time

Issues in Performance Execution time

Parallel Performance Issues

Performance Metrics

Performance Metrics

Performance Metrics

Performance Metrics

Performance Metrics

Performance Metrics

Asymptotic Analysis

Amdahl’s Law

Amdahl’s Law

Efficiency

What is Optimization?

Types of Optimization

Steps of Optimization

Performance Strategies

Performance Strategies

Considerations when Optimizing

Locality

Memory Hierarchy

SP2 Access Times

Cache Performance

Types of Cache

Memory Access

Memory Access

Serial Optimizations

Array Optimization

Array Allocation

Array Padding

Stride Minimization

Stride Minimization

Stride Minimization

Array Initialization

Array Initialization

Loop Fusion

Loop Fusion

Loop Fusion

Loop Interchange

Loop Interchange

Loop Interchange

Floating IF’s

Floating IF’s

Floating IF’s

Loop Defactorization

Loop Defactorization

Loop Defactorization

Loop Defactorization

Loop Peeling

Loop Peeling

Loop Peeling

Loop Collapse

Loop Collapse

Loop Collapse

Loop Unrolling

Loop Unrolling

Loop Unrolling

Loop Unrolling

Loop Unrolling and Sum Reductions

Loop Unrolling and Sum Reductions

Loop Unrolling and Sum Reductions

Outer Loop Unrolling

Outer Loop Unrolling

Outer Loop Unrolling

Outer Loop Unrolling

Loop structure

Strength Reduction

Strength Reduction Horner’s Rule

Strength Reduction Horner’s Rule

Strength Reduction Horner’s Rule

Strength Reduction Integer Division by a Power of 2

Strength Reduction Integer division by a Power of 2

Strength Reduction Integer division by a Power of 2

Strength Reduction Factorization

Strength Reduction Factorization

Strength Reduction Factorization

Subexpression Elimination Parenthesis

Subexpression Elimination Parenthesis

Subexpression Elimination Parenthesis

Subexpression Elimination Type Considerations

Subexpression Elimination Type Considerations

I/O Considerations

I/O Considerations

Optimized Arithmetic Libraries

Optimized Arithmetic Libraries

BLAS

Level 1, 2 and 3 BLAS

BLAS for Performance

BLAS Performance

BLACS -- Introduction

BLACS -- Basics

BLACS -- Basics

PBLAS -- Introduction

Scope of the PBLAS

PBLAS

PBLAS -- Syntax

PBLAS -- Storage Conventions

PBLAS -- Examples

PBLAS -- Examples

Features of PBLAS V2 ALPHA

Features of PBLAS V2 ALPHA

LAPACK

LAPACK

LAPACK -- Release 3.0

ScaLAPACK Structure

LAPACK - Goals

ScaLAPACK Implementation

ScaLAPCK Functionality

ScaLAPACK Functionality

Parallelism in ScaLAPACK

Narrow Band and Tridiagonal Matrices

ScaLAPACK Documentation

ESSL

Optimized Arithmetic Libraries

Optimized Arithmetic Libraries

Parallel Optimization

Choosing a Data Distribution

Possible Data Layouts

Two-dimensional Block-Cyclic Distribution

Load Balancing

MPP Optimization

Parallel Performance

Message Passing

Message Passing

Message Passing

Message Passing

Message Passing

Communication Issues

Communication Issues

Communication Issues

Message Passing

MPI Bandwidth

MPI Send Latency

Message Passing

PVM Message Passing

PVM Optimizations

PVM Optimizations

PVM Optimizations

MPI Message Passing

MPI Message Passing

MPI Message Passing

MPI Optimizations

MPI Data Types

MPI Collective Communication

MPI Collective Communication

Message Passing Optimizations

Message Passing Optimization Nearest Neighbor Example 1

Message Passing Optimization Nearest Neighbor Example 2

Message Passing Optimization Nearest Neighbor Example

MPI Message Passing

Automatic Parallelization

Automatic Parallelization

Data Parallelism

Data Parallelism on the SGI’s

Data Parallelism on the SGI’s

Data Parallelism on the SGI’s

Task Parallelism

Task Parallelism

Limits on Parallel Speedup

Parallel Overhead

Parallel Overhead

Reducing Parallel Overhead

Reducing Parallel Overhead

Improving Load Balance

Improving Load Balance

Improving Load Balance

Improving Load Balance

SGI Origin 2000

SGI Origin 2000

SP2

SP2

T3E

T3E

T3E

T3E Cache Bypass

T3E Streams

T3E Streams

T3E E-registers

T3E E-registers

T3E/SGI Software Pipelining

Subroutine Inlining

O2K Flags and Libraries

SP2 Flags and Libraries

T3E Flags and Libraries

Timers

Timers

Timers

Timers

Timers

Timers

prof

prof

gprof

gprof

prof/gprof

PPT Slide

Parallel Performance Tool Capabilities

Speedshop

Speedshop (cont)

Using Speedshop

Pcsamp Example

Example of Usertime

Speedshop

Speedshop (cont)

Using Speedshop

Ideal experiment

Pcsamp Example

Example of Usertime

tprof for the SP2

tprof for the SP2

PAT for the T3E

PAT for the T3E

Apprentice for the T3E

PPT Slide

Automated Instrumentation and Monitoring System (AIMS)

AIMS Components

xinstrument

xinstrument (cont.)

Visualizing Trace Files with VK

Controlling Scale and Speed of Playback

PPT Slide

tally output - tally.summary

tally.summary (cont.)

tally output - ncpu.summary

MPE Logging/nupshot

PPT Slide

MPE Logging Library (cont.)

nupshot

PPT Slide

Pablo TraceLibrary

I/O Extension to TraceLibrary

I/O Extension (cont.)

I/O Extension (cont.)

MPI TraceLibrary Extension

MPI Extension (cont.)

Pablo Trace File Analysis

SDDFStatistics

SDDFStatistics Usage

I/O Analysis Programs

I/O Analysis Programs (cont.)

I/O Analysis Programs (cont.)

Pablo Analysis GUI

Analysis GUI (cont.)

Paradyn

Paradyn Goals

Paradyn Approach

Paradyn Components

Performance Consultant

Performance Consultant (cont.)

SvPablo

SvPablo

SvPablo Project

Line Metrics

VAMPIR

VAMPIR

VAMPIR Displays

References

Additional Documentation

References

Author: Philip J. Mucci

Email: mucci@cs.utk.edu

Home Page: http://www.cs.utk.edu/~mucci


Author: Kevin S. London

Email: london@cs.utk.edu

Home Page: http://www.cs.utk.edu/~london