Table of ContentsMPP Performance Optimization Outline Performance Performance Examples Issues in PerformanceProblem Size and Precision Issues in PerformanceExecution time Issues in PerformanceExecution time Parallel Performance Issues Performance Metrics Performance Metrics Performance Metrics Performance Metrics Performance Metrics Performance Metrics Asymptotic Analysis Amdahl’s Law Amdahl’s Law Efficiency What is Optimization? Types of Optimization Steps of Optimization Performance Strategies Performance Strategies Considerations when Optimizing Locality Memory Hierarchy SP2 Access Times Cache Performance Types of Cache Memory Access Memory Access Serial Optimizations Array Optimization Array Allocation Array Padding Stride Minimization Stride Minimization Stride Minimization Array Initialization Array Initialization Loop Fusion Loop Fusion Loop Fusion Loop Interchange Loop Interchange Loop Interchange Floating IF’s Floating IF’s Floating IF’s Loop Defactorization Loop Defactorization Loop Defactorization Loop Defactorization Loop Peeling Loop Peeling Loop Peeling Loop Collapse Loop Collapse Loop Collapse Loop Unrolling Loop Unrolling Loop Unrolling Loop Unrolling Loop Unrolling and Sum Reductions Loop Unrolling and Sum Reductions Loop Unrolling and Sum Reductions Outer Loop Unrolling Outer Loop Unrolling Outer Loop Unrolling Outer Loop Unrolling Loop structure Strength Reduction Strength ReductionHorner’s Rule Strength ReductionHorner’s Rule Strength ReductionHorner’s Rule Strength ReductionInteger Division by a Power of 2 Strength ReductionInteger division by a Power of 2 Strength Reduction Integer division by a Power of 2 Strength ReductionFactorization Strength ReductionFactorization Strength ReductionFactorization Subexpression EliminationParenthesis Subexpression EliminationParenthesis Subexpression EliminationParenthesis Subexpression EliminationType Considerations Subexpression EliminationType Considerations I/O Considerations I/O Considerations Optimized Arithmetic Libraries Optimized Arithmetic Libraries BLAS Level 1, 2 and 3 BLAS BLAS for Performance BLAS Performance BLACS -- Introduction BLACS -- Basics BLACS -- Basics PBLAS -- Introduction Scope of the PBLAS PBLAS PBLAS -- Syntax PBLAS -- Storage Conventions PBLAS -- Examples PBLAS -- Examples Features of PBLAS V2 ALPHA Features of PBLAS V2 ALPHA LAPACK LAPACK LAPACK -- Release 3.0 ScaLAPACK Structure LAPACK - Goals ScaLAPACK Implementation ScaLAPCK Functionality ScaLAPACK Functionality Parallelism in ScaLAPACK Narrow Band and Tridiagonal Matrices ScaLAPACK Documentation ESSL Optimized Arithmetic Libraries Optimized Arithmetic Libraries Parallel Optimization Choosing a Data Distribution Possible Data Layouts Two-dimensional Block-Cyclic Distribution Load Balancing MPP Optimization Parallel Performance Message Passing Message Passing Message Passing Message Passing Message Passing Communication Issues Communication Issues Communication Issues Message Passing MPI Bandwidth MPI Send Latency Message Passing PVM Message Passing PVM Optimizations PVM Optimizations PVM Optimizations MPI Message Passing MPI Message Passing MPI Message Passing MPI Optimizations MPI Data Types MPI Collective Communication MPI Collective Communication Message Passing Optimizations Message Passing OptimizationNearest Neighbor Example 1 Message Passing OptimizationNearest Neighbor Example 2 Message Passing OptimizationNearest Neighbor Example MPI Message Passing Automatic Parallelization Automatic Parallelization Data Parallelism Data Parallelism on the SGI’s Data Parallelism on the SGI’s Data Parallelism on the SGI’s Task Parallelism Task Parallelism Limits on Parallel Speedup Parallel Overhead Parallel Overhead Reducing Parallel Overhead Reducing Parallel Overhead Improving Load Balance Improving Load Balance Improving Load Balance Improving Load Balance SGI Origin 2000 SGI Origin 2000 SP2 SP2 T3E T3E T3E T3E Cache Bypass T3E Streams T3E Streams T3E E-registers T3E E-registers T3E/SGI Software Pipelining Subroutine Inlining O2K Flags and Libraries SP2 Flags and Libraries T3E Flags and Libraries Timers Timers Timers Timers Timers Timers prof prof gprof gprof prof/gprof PPT Slide Parallel Performance Tool Capabilities Speedshop Speedshop (cont) Using Speedshop Pcsamp Example Example of Usertime Speedshop Speedshop (cont) Using Speedshop Ideal experiment Pcsamp Example Example of Usertime tprof for the SP2 tprof for the SP2 PAT for the T3E PAT for the T3E Apprentice for the T3E PPT Slide Automated Instrumentation and Monitoring System (AIMS) AIMS Components xinstrument xinstrument (cont.) Visualizing Trace Files with VK Controlling Scale and Speed of Playback PPT Slide tally output - tally.summary tally.summary (cont.) tally output - ncpu.summary MPE Logging/nupshot PPT Slide MPE Logging Library (cont.) nupshot PPT Slide Pablo TraceLibrary I/O Extension to TraceLibrary I/O Extension (cont.) I/O Extension (cont.) MPI TraceLibrary Extension MPI Extension (cont.) Pablo Trace File Analysis SDDFStatistics SDDFStatistics Usage I/O Analysis Programs I/O Analysis Programs (cont.) I/O Analysis Programs (cont.) Pablo Analysis GUI Analysis GUI (cont.) Paradyn Paradyn Goals Paradyn Approach Paradyn Components Performance Consultant Performance Consultant (cont.) SvPablo SvPablo SvPablo Project Line Metrics VAMPIR VAMPIR VAMPIR Displays References Additional Documentation References |
Author: Philip J. Mucci
Email: mucci@cs.utk.edu Home Page: http://www.cs.utk.edu/~mucci Author: Kevin S. London Email: london@cs.utk.edu Home Page: http://www.cs.utk.edu/~london |