Characterizing Load and Communication Imbalance in Parallel Applications
Autori
Viac o knihe
The amount of parallelism in modern supercomputers currently grows from generation to generation, and is expected to reach orders of millions of processor cores in a single system in the near future. Further application performance improvements therefore depend to a large extend on software-managed parallelism: in particular, the software must organize data exchange between processing elements efficiently and optimally distribute the workload between them. Performance analysis tools help developers of parallel applications to evaluate and optimize the parallel efficiency of their programs by pinpointing specific performance bottlenecks. However, existing tools are often incapable of identifying complex imbalance patterns and determining their performance impact reliably. This dissertation presents two novel methods to automatically extract imbalance-related performance problems from event traces generated by MPI programs and intuitively guide the performance analyst to inefficiencies whose optimization promise the highest benefit. The first method, the delay analysis, identifies the root causes of wait states. A delay occurs when a program activity needs more time on one process than on another, which leads to the formation of wait states at a subsequent synchronization point. Wait states, which are intervals through which a process is idle while waiting for the delayed process, are the primary symptom of load imbalance in parallel programs. While wait states themselves are easy to detect, the potentially large temporal and spatial distance between wait states and the delays causing them complicates the identification of wait-state root causes. The delay analysis closes this gap, accounting for both short-term and long-term effects. To this end, the delay analysis comprises two contributions of this dissertation: (1) a cost model and terminology to describe the severity of a delay in terms of the overall waiting time it causes; and (2) a scalable algorithm to identify the locations of delays and determine their cost. The second new analysis method is based on the detection of the critical path. In contrast to the delay analysis, which characterizes the formation of wait states, this critical-path analysis determines the effect of imbalance on program runtime. The critical path is the longest execution path in a parallel program without wait states: optimizing an activity on the critical path will reduce the programs run time. Comparing the duration of activities on the critical path with their duration on each process yields a set of novel, compact performance indicators. These indicators allow users to evaluate load balance, identify performance bottlenecks, and determine the performance impact of load imbalance at first glance by providing an intuitive understanding of complex performance phenomena. Unlike existing statistics-based load balance metrics, these indicators are applicable to both SPMD and MPMD-style programs. Both analysis methods leverage the scalable event-trace analysis technique employed by the Scalasca toolset: by replaying event traces in parallel, the bottleneck search algorithms can harness the distributed memory and computational resources of the target system for the analysis, allowing them to process even large-scale program runs. The scalability and performance insight that the novel analysis approaches provide are demonstrated by evaluating a variety of real-world HPC codes in configurations with up to 262,144 processor cores