Abstracts
Performance measurement of modern Fortran MPI applications with Score-P
Gregor Corbin, Jülich Supercomputing Centre (JSC)
Version 3.1 of the Message-Passing Interface (MPI) standard, released in 2015, introduced a new set of language bindings for Fortran 2008. By making use of modern language features and the enhanced interoperability with C, there was finally a type safe and standard conforming method to call MPI from Fortran. This highly recommended `use mpi_f08` language binding has since then been widely adopted among developers of modern Fortran applications. However, tool support for the F08 bindings is still lacking almost a decade later, forcing users to recede to the less safe and convenient interfaces. Full support for the F08 bindings was added to the performance measurement infrastructure Score-P by implementing MPI wrappers in Fortran. Wrappers cover the latest MPI standard version 4.1 in its entirety, matching the features of the C wrappers. By implementing the wrappers in modern Fortran, we can provide full support for MPI procedures passing attributes, info objects, or callbacks. The implementation is regularly tested under the MPICH test suite. The new F08 wrappers were already used successfully to generate performance measurements for two fluid dynamics simulation codes: Neko, a spectral finite-element code derived from Nek5000, and EPIC (Elliptical Parcel-In-Cell). In this work, we additionally present our design considerations and sketch out the implementation, discussing the challenges we faced in the process. The key component of the implementation is a code generator that produces approximately 50k lines of MPI wrapper code to be used by Score-P, relying on the Python pympistandard module to provide programmatic access to the extracted data from the MPI Standard. Although the current generator is tailored to the needs of Score-P, future versions could be made generic to offer the functionality to other tool developers.
Leveraging Machine-readable MPI API Specification for Tool Development
Felix Tomski, Joachim Jenke, Simon Schwitanski, RWTH Aachen University
The number of functions in the MPI standard grows with every new release, forcing tool developers to keep up with the latest changes. Often existing tool functionality can be reused for new MPI functions, e.g., when new flavors of existing functions are added to the standard. Moreover, additional knowledge about MPI functions, such as being local or collective, is often important for tool developers. While the MPI standard includes such an overview, there is no machine-readable version publicly available that could be used to automate tool development. The MPI Forum is internally working on a JSON database of the function API specification used to generate the code parts of the standard document. This work investigates the usefulness of such machine-readable specification for the MPI correctness checking tool MUST. We leverage the API specification for two purposes to enhance MUST. First, finding inconsistencies in the mapping of correctness checks to supported MPI functions. Second, applying appropriate correctness checks automatically to functions introduced in new MPI standard releases. Additionally, we discuss the limitations of the machine-readable file in its current form and propose to extend the API specification by additional function attributes. Such attributes include marking the function as being constructor/destructor, (non)blocking, (non)local, (non)collective, or a communication function. While we only showcase the usefulness of an enriched, machine-readable MPI API specification for the development of MUST, other MPI tools, e.g. for performance analysis, should be able to profit likewise.
TALP-Pages: An easy-to-integrate continuous performance monitoring framework
Valentin Seitz, Jordy Trilaksono, Marta Garcia-Gasulla, Barcelona Supercomputing Center + Max Planck Institute for Plasma Physics
Ensuring good performance is a key aspect in the develop- ment of codes that target HPC machines. As these codes are often un- der active development, the necessity to detect performance degradation early in the development process becomes apparent. Additionally, hav- ing meaningful insight into application behavior tightly coupled to the development workflow is essential. In this paper, we introduce TALP-Pages, an easy-to-integrate framework that enables developers to get fast and in-repository feedback about their code performance using the well established POP performance metrics. The framework relies on TALP [1], which enables the collection of the metrics. This can be done either without code changes, or by annotating the code. An additional layer will then generate static websites to vi- sualize the generated metrics and their evolution over different versions of the software. These self-contained HTML pages can either be viewed standalone or integrated easily into CI/CD solutions like GitLab Pages. Compared to other approaches, i.e. [2, 3], it does not require the devel- opers to maintain additional infrastructure, but solely relies on existing CI/CD mechanisms. To showcase the ease of use and effectiveness of this approach, we extend the current CI/CD GitLab setup of GENE-X [4] with only minimal code changes required.
META: A Toolkit for Template Metaprogramming Performance Analysis
Christopher Taylor, Tactical Computing Labs, LLC
HPC developers often implement domain-specific languages and libraries to improve the productivity of scientists. HPC software implementing these languages and libraries employ advanced C++ language techniques such as template metaprogramming. HPC oriented libraries exercising template metaprogramming techniques permit a substantial level of customization and portability across multiple systems. Although applications developed using these HPC libraries can yield high performance, the libraries may inadvertently introduce performance portability regressions on different systems. Performance portability regressions may be discovered when an HPC application is migrated from current to previous or next generation systems. Portability-related regressions caused by the application of advanced C++ techniques can be challenging for the compiler to identify and correct. Identifying these specific types of regressions requires scientists to command expertise of their compiler, advanced C++ template programming techniques, and the underlying hardware. Without the requisite knowledge or background, performance portability regressions may remain undetected and unresolved. META is a portable static analysis infrastructure. META addresses these challenges by extending the LLVM compiler toolchain such that it can detect performance regressions, deadlocks, race conditions, and offer concrete suggestions to improve performance. META can offer possible modifications for applications written with C++ domain-specific languages and parallel template metaprogramming libraries. META currently offers support for both RAJA, Kokkos, and a mix of both libraries with STL data parallel control structures (mutex, lock, condition variables, etc).
Score-P and OMPT: Smoothing the bumpy road to OpenMP performance measurement
Jan André Reuter, Christian Feld, Bernd Mohr, Jülich Supercomputing Centre (JSC)
The OpenMP API is a widely used interface for high-level parallel programming in C, C++ and Fortran. Initially introduced in 1997, it now targets three basic processor building blocks, CPUs, SIMD vector units, and accelerators. With large adoption in the HPC community and wide support from compiler vendors, OpenMP grew into a key component in leveraging node-level parallelism in applications and frameworks. Herewith, a need for OpenMP-aware performance measurement and analysis tools arose. In version 5.0 of the OpenMP specification, the OpenMP Tools Interface (OMPT) was introduced, providing means to collect precise information about the application's use of OpenMP directives and lock routines. Although provided with a detailed specification, understanding and correctly handling the CPU execution model event sequence dispatched from various vendor's runtimes requires detailed analysis of events, their parameters and executing threads. To facilitate this analysis, we developed a freely available OMPT tool that allows for dumping execution model events and corresponding metadata for post-mortem inspection. Analyzing the output of this tool applied to the official OpenMP examples and handwritten smoke tests, enabled us to implement an OMPT tool for the performance measurement infrastructure Score-P, replacing the long-established, but feature-incomplete source-to-source OpenMP instrumenter OPARI2. Both OMPT tools are regularly tested against the aforementioned OpenMP examples and smoke tests. As vendors take the freedom to interpret the OMPT specification, various checks were developed to detect deviations. In Score-P, deviations are classified as fatal, disengageable, and remediable. Based on feedback given to the vendors, several of the deviations are no longer a concern. Accompanying the development of OMPT itself, the overhead being introduced in the OpenMP runtimes was always a concern. To assess this overhead in various contemporary runtimes, we used the EPCC and SPEC OpenMP benchmark suites, with OMPT disabled (if possible), with a dummy tool, and with the Score-P OMPT tool attached.
OMPT Support in ROCm: Past, Present & Future
Michael Halkenhäuser, Dhruva Chakrabarti, Jan-Patrick Lehr, Ron Lieberman, Advanced Micro Devices, Inc.
In this experience report, we will cover the ongoing development of OMPT support for OpenMP target offloading in ROCm. While it has been available since ROCm v5, it has lived through some changes, regressions, and improvements. Since then, it has become vital for tool implementers both inside and outside of AMD. In our presentation, we provide insight into the development history and some of the challenges we went through. Additionally, we will cover recent developments and ideas for the future. We first present some history from initial patches to enable OMPT callback support and how they have been incorporated into ROCm. Subsequently, we move on to the upstreaming effort for OMPT callback support of OpenMP target offloading and which challenges we faced. These mostly circle around changes requested in code review for upstreaming, and the engineering necessary during the subsequent downstreaming. A critical point will be to highlight a few bugs that got introduced during the process. One of the reasons for the required engineering is the currently downstream-only functionality of device-side tracing and how it interacts with the callback support. Here, we present the recent enablement of device-side tracing without falling back to synchronous execution. Furthermore, we present an overview of our efforts to enable better test coverage of the OMPT implementation and tool integration. Finally, we outline our plans for upstreaming the device-side tracing implementation into upstream-LLVM and close with ideas for information that could be exposed via an OMPT “native” data type to allow for a discussion.
OTF-CPT: Application Insights Gained from Real-time Critical Path Analysis
Ben Thärigen, Joachim Jenke, Tobias Dollenbacher, Fabian Orland, RWTH Aachen University
The critical path of an application is an essential indicator in performance analysis as it is closely correlated to the application's total execution time. A lot of research has been done to identify the critical path or a profile of it post-mortem and discover performance issues with it. Due to the definition of the critical path itself, an efficient online algorithm to determine the whole critical path is highly unlikely. However, recent work showcased that collecting less information and focusing on, e.g., only the critical path length, can be done on the fly without significant runtime overhead. The limited information can still yield significant performance insights, as different critical path metrics are crucial in determining fundamental performance factors. In the context of this work, the on-the-fly critical-path tool (OTF-CPT) was born, which collects information about the critical path during the execution of an application without significant runtime overhead. In this workshop contribution, we show how the OTF-CPT can be used to gain new insights into the scaling behavior of parallel applications.
Performance Analysis and Optimization of CFD Applications
Martin Clemens, Jonathan Fenske, Daniel Molka, Hirav Patel, Michael Wagner, Ronny Tschüter, German Aerospace Center (DLR)
Computational Fluid Dynamics (CFD) simulations are a valuable tool in research and engineering. The increasing demand for compute performance and scalability of high-fidelity CFD simulations requires continuous performance analysis and optimization as part of the CFD software development cycle. For this task, performance tools provide significant support for software developers to investigate and understand the application behavior. This work evaluates the performance of different CFD codes and provides an overview of aspects that are crucial for highly scalable High Performance Computing (HPC) applications. This work applies performance modelling to a code for the computation of the three-dimensional unsteady flow in multi-stage compressors and turbine components to estimate its scalability. In addition, this work presents scalability studies of a next-generation CFD software for aerodynamic simulations of fully equipped aircraft, demonstrating that mesh partitioning is a performance critical part of CFD simulations. Finally, this work compares the results of weak and strong scaling experiments on different compute architectures and emphasises the benefits of offloading computationally intensive code sections to GPUs.
Performance and Scalability of the CODA CFD Software on the AMD Naples and Rome Architectures
Michael Wagner, German Aerospace Center (DLR)
Computational Fluid Dynamics (CFD) simulations are playing an increasingly important role in aircraft design. They provide detailed insight into the aerodynamic behavior of components and help reduce costs and development time. CODA is a next-generation CFD software for aerodynamic simulations of fully equipped aircraft. Developed by the German Aerospace Center (DLR), the French Aerospace Laboratory (ONERA) and Airbus, the CODA CFD software is one of the key applications represented in the European Center of Excellence for Engineering Applications (Excellerat). This work evaluates the performance of the CODA CFD software on the two large HPC production systems at DLR based on the AMD Naples and Rome architectures with the NASA Common Research Model. The evaluation includes a comparison of compute performance, an assessment of strong and weak scaling behavior on the largest available partitions of the two production systems, and a discussion of performance and scalability using various performance analysis tools.
Exploring Multi-threaded Communication Behavior of a Large-Scale CFD Solver with Vampir
Denis Hünich, Ronny Tschüter, Bert Wesarg, TU Dresden, CIDS/ZIH + German Aerospace Center (DLR)
Analyzing the communication behavior of distributed parallel applications is a key task of parallel performance tools. Over the years, the MPI standard has extended the possibilities to leverage threads within individual processes. The current MPI standard defines several levels of thread-support, with MPI_THREAD_MULTIPLE being the least restrictive level, at which any thread can issue MPI calls at any time. This feature allows applications maximum flexibility in their communication patterns, is used in production codes, but poses major challenges for performance analysis frameworks. Irrespective of the challenges associated with trace data recording, this work focuses on challenges for a scalable performance analysis: reading MPI events across multiple threads, aggregating information from distributed communication events and deriving correct communication metrics. In this work, we present enhancements to the distributed analysis engine of Vampir – a tool to visualize and analyze MPI communication behavior – to support the investigation of MPI calls on any thread level. We demonstrate the applicability of our work by a performance evaluation of a large-scale CFD solver, especially investigating its MPI_THREAD_MULTIPLE based communication patterns which was not possible before. Furthermore, we also compare the induced overhead to the load-time with previous versions of Vampir.
Enable I/O Metadata Analysis in Score-P and OTF2
Sebastian Oeste, Radita Liem, Bert Wesarg, TU Dresden, CIDS/ZIH + RWTH Aachen University
The I/O workload on supercomputers impacts the overall performance of the system by challenging the distributed file system. Many modern parallel and distributed file systems separate data and metadata. The data throughput can easily scale with the increasing number of data servers. Metadata servers are more likely to be a target of conflicts for latency-sensitive work. In particular, workloads with a small I/O volume but a large number of files are limited by their metadata performance. To guide the development of programs and file systems, a deep understanding of metadata operations is crucial. However, current analysis tools tend to focus on data operations such as read and write, while metadata acquisition is not or only partially supported. In this paper, we extend the widely known Score-P analysis framework by the missing metadata operations. We present a first time-to-solution analysis of metadata operations and show the impact on the overall I/O time of an application. Furthermore, we discuss the overhead of tracing and the possiblities of including more I/O semantics in the trace with additional OTF2 records. The resulting findings form the basis for a holistic I/O analysis and in-depth metadata introspection.
Blackheap: End-to-End I/O Performance Modeling and Classification for High-Performance Computing
Lars Quentin, GWDG, University of Goettingen
As High-Performance Computing (HPC) increasingly shifts from traditional compute-intensive to more data-intensive workloads, optimizing I/O operations becomes essential to avoid bottlenecks. While measuring the I/O runtime is straightforward, evaluating the performance quality of I/O accesses is very complex due to the intricate nature of I/O systems, especially in a multi-user environment utilizing large parallel storage clusters. Theoretical analysis is almost impossible, and end users often lack the sophisticated knowledge and permissions to properly assess the performance of their compute tasks. This paper presents Blackheap, an automated, zero-configuration tool for benchmarking, modeling, and classifying I/O access patterns in HPC environments. Blackheap uses simple I/O characteristics such as the type (read, write), offset to last operation, size, and time to group accesses into different classes. It offers an automated workflow that benchmarks different access patterns (sequential, random, cached, reversed), builds performance models using Kernel Density Estimations (KDE) and linear regression, and provides tools to measure and classify any third-party program without requiring software modifications or recompilation. Blackheap's models can be applied through two methods: Firstly using `LD_PRELOAD`, which overwrites the POSIX functions responsible for I/O and is logging the results directly to disk for manual analysis. Secondly utilizing `iofs`, a FUSE file system that can automatically classifies and streams the metrics to monitoring systems such as Grafana for centralized real-time visualizations. Blackheap's accuracy and correctness were verified by re-running benchmarks and confirming consistent classifications. It is currently further evaluated against the parallel IOR benchmarker, which is also used for the IO500 benchmarks. Blackheap offers a blackbox, end-to-end tool for I/O performance modeling and classification in HPC environments.
Augmenting HPC I/O Performance Analysis with Detailed Block Layer Insights
Christian von Elm, Sebastian Oeste, Mario Bielert, Thomas Ilsche, TU Dresden, CIDS/ZIH
I/O performance is crucial for many contemporary applications on modern supercomputers, where operations involve a deeply layered storage stack. This stack includes the application layer, I/O libraries, file system interfaces, and the block layer, followed by device drivers and physical devices. Effective I/O performance depends not only on the capability of the devices but also on how well the different layers interact. Consequently, to achieve the best possible application performance it is necessary to examine the behavior of I/O on different levels of the I/O stack to attribute complex performance characteristics to specific layers of the storage stack. Current HPC analysis tools analyze I/O behavior but mostly focus on I/O libraries such as MPIIO or the file system interfaces like POSIX I/O. To provide a more comprehensive analysis and a unique perspective of HPC I/O performance, we include the block layer in the node-level monitoring tool lo2s. We implement a non-intrusive method of block layer event collection using kernel tracepoints. Interactive graphical visualization of the collected information is supported by Vampir. Leveraging this approach, we highlight differences in how file systems map POSIX calls to block I/O. The resulting holistic insight provides a strong foundation for optimizing the performance of parallel I/O.
Enhancing Performance Analysis of Parallel Tools for HPC Applications on Fugaku Supercomputer
Samar A. Aseeri, Judit Gimenez, Sameer S. Shende, Benson K. Muite, David E. Keyes, KAUST Extreme Research Computing Center, Saudi Arabia + Barcelona Supercomputing Center + University of Oregon + Kichakato Kizito, Kenya
In the field of Parallel Computing, the efficacy of performance profiling tools is paramount for maximizing application efficiency. This study conducts measurements and comparisons of TAU and Extrae on the Fugaku Supercomputer, focusing on their impact on FFT-based solvers for the Klein Gordon equation. The research evaluates tool overhead, memory usage, and communication modes, comparing open-source tools like TAU and Extrae with vendor-provided options such as FIPP and FAPP. Leveraging the torus network on Fugaku enables isolated communication for reliable measurements, highlighting minimal profiling runtime overheads. This study provides valuable insights for enhancing performance analysis in HPC applications and optimizing parallel programming environments.
The Road to PC Sampling on AMD InstinctTM MI200 GPU Series
Vladimir Indic, Timour Paltashev, Advanced Micro Devices, Inc
The rise of AI has led to the development of large language models, which require substantial computing power for training and inference. Modern supercomputers used for training LLMs and running HPC workloads employ many GPUs to satisfy this demand for computation power. Consequently, the need for quality GPU profiling tools arose. The first exascale supercomputer, Frontier, harnesses most of its computational power from the AMD MI250X GPUs. Most state-of-the-art profiling tools provide only coarse-grained metrics for determining kernel performance on AMD GPUs by employing host-side tracing and performance counters sampling. The latter approach often requires kernel serialization to provide meaningful information for each kernel independently. Understanding the performance at the instruction level is possible by employing advanced thread tracing, with restrictions of monitoring a single SIMD compute unit per shader engine at a time. This technical paper presents the design of program counters (PC) sampling method on MI2xx families that overcomes the drawbacks of existing profiling methods on AMD GPUs. The PC sampling method is commonly used on other vendor architectures to statistically approximate kernel execution. It involves periodically interrupting kernels executing and creating an execution profile represented as a histogram of program counters (PC) sampled over time. We outline the PC sampling method's design and demonstrate the API implemented in AMD's profiling library, ROCProfiler-SDK. We managed to resolve some of the implementation shortcomings of other vendors by enabling the monitoring execution of concurrent kernels offloaded by different processes typical for MPI HPC applications. Furthermore, we enabled the distinction of execution costs among multiple invocations of the same kernel, allowing application developers to understand how invocation contexts impact kernel performance. We conclude the paper by highlighting the practical usage of ROCProfiler-SDK PC sampling API for profiling real AI and HPC workloads.
A Runtime-Adaptable Instrumentation Back-End for Score-P
Sebastian Kreutzer, Paul Adelmann, Christian Bischof, TU Darmstadt
Score-P provides compiler plugins for GCC and Clang that perform a static instrumentation of the target code, enabling detailed function-level profiling and tracing. Typically, the instrumentation configuration needs to be filtered to reduce the inherent measurement overhead to an acceptable level. In Score-P, filtering is currently possible both at compile-time and runtime. However, recompilation is time-consuming, and typically not feasible for large applications. On the other hand, runtime filtering, while more flexible, still imparts measurable overheads. As an alternative instrumentation approach, Clang's XRay feature [1] enables runtime-adaptable instrumentation without recompilation. XRay statically inserts placeholders instruction sleds into possible instrumentation points, which can be selectively patched at runtime. As a result, instrumentation filters can be swapped out dynamically, with near-zero overhead on inactive sleds. Compared to purely dynamic instrumentation methods, XRay imparts less complexity by integrating into the static compilation toolchain. In this work, we present a new XRay based instrumentation adapter for Score-P. We evaluate the runtime of the Score-P profiling and tracing modes with XRay, compared to the static instrumentation inserted by the existing LLVM plugin. We also incorporate performance measurements of a recent extension to XRay for shared library instrumentation [2]. Early results on proxy applications show comparable measurement overheads to static instrumentation for fully instrumented runs, and reduced overhead for filtered runs.
Always-on Application Introspection for Large HPC Systems: Benefits and Progress
Amir Raoofy, Josef Weidendorfer; Michael Ott, Carla Guillen Carias, LRZ Garching
Having insight into the characteristics of the applications running on their HPC systems is a huge benefit for HPC centres. Introspection of user applications via always-on system-wide monitoring tools can provide this insight without bothering the users. At Leibniz Supercomputing Centre, we integrate a lightweight sampling method in our in-house monitoring tool DCDB. this scheme leverages eBPF (extended Berkeley Packet Filter) from modern Linux kernels. In this paper, we showcase the benefits of enabling introspection in DCDB and discuss the latest developments and progress.
Performance Analysis of Machine Learning Models Using Vampir: A Deep Dive into I/O Dynamics
Zoya Masih, GWDG, University of Goettingen
Despite the abundance of research in the AI field, there is a noticeable gap in resources specifically addressing performance analysis in terms of I/O operations, when an AI workload is utilizing a system’s resources. This talk intends to fill that gap by presenting an approach to observing the I/O behavior of ML models. In this talk, we will explore the performance characteristics of machine learning models with a focus on I/O dynamics, utilizing the powerful Vampir performance analysis tool. This session aims to provide participants with a comprehensive understanding of what transpires during various stages of two ML model executions. Participants will be guided through a detailed methodology on how to use Vampir to monitor and analyze I/O behavior effectively, uncover potential bottlenecks, and optimize performance. The session will delve into probable scenarios encountered during model training and data processing, with a focus on the deployment of NVIDIA’s Data Loading Library (DALI) for handling large-scale data inputs.
Insights into Performance of Machine Learning Optimizers Using ScoreP
Jack Ogaja, Eliah Windolph, GWDG, University of Goettingen
We demonstrate how to use Score-P to instrument and profile Machine Learning applications. From the generated profiles and traces, we aim to gain insights into the memory footprints and convergence speed of first-order optimization algorithms currently used in training machine learning models. The resulting metrics are used to characterize performance of the selected algorithms and modern computer architectures.
Accelerating the FlowSimulator: Tracing and Profiling of Python Toolchains for
Industry-Grade Simulations
Marco Cristofaro, Johannes Wendler, Lars Reimer, Immo Huismann, German Aerospace Center (DLR)
High-performance computing is increasingly important in the aeronautical industry, enabling part of the certification process to be done via simulation. Current aviation regulations allow the use of simulation techniques to replace costly and potentially hazardous tests for aircraft and helicopters. But, for simulations to be accepted, their accuracy must match the one of physical testing. High-fidelity simulation models, required for the needed accuracy, involve very fine meshes, making high-performance computing essential. However, as hardware complexity grows, time-to-solution may not meet expectations. Complex toolchains can experience unexpected bottlenecks even in simple simulation blocks, making tracing and profiling tools crucial for identifying issues. In this work, we investigate a complex simulation toolchain to compute the static aeroelastic equilibrium of a wing [1]. On the outer level, a Python control layer manages individual solvers for Computational Fluid Dynamics, Computational Structural Mechanics, mesh deformation, and interpolation methods. These solver components, in turn, are implemented using a Python layer wrapping the underlying C++ libraries. We use Score-P to extract simulation run traces and analyze them with Vampir and CUBE. Results show the actual relation between the runtimes of individual simulation blocks and to the whole runtime for each MPI process. Critical components requiring further investigation are then identified for future detailed analyses. Last, but not least, the contribution discusses the usability of the performance tools in the context of mixed C++ / Python workflows.
Integrating performance analysis in JupyterLab for OpenMP and MPI
Klaus Nölp, Lena Oden, FernUniversität in Hagen
Jupyter Notebooks are used for programming in Python or R for machine learning and data analysis. However, with the xeus-cling C++ kernel, they can also be used for parallel programming in MPI and OpenMP. This environment is very suitable for teaching, as students can quickly test and try out functions. Furthermore, the interactive use of the environment is also well suited for the initial testing and prototyping of new algorithms. Despite that, the interpretation of C/C++ code is not suitable for analyzing performance. For this reason, we have developed JUMPED (Jupyter Measurement and Performance Environment for Development)*,* which offers options for performance analysis within JupyterLab and compiles and directly executes the previously interpreted code. On the one hand, setting simple markers in the code enables the automated execution of benchmarks and scalability tests directly in JupyterLab. On the other hand, the integration of performance tools, such as Score-P in our example, allows a deeper analysis of parallel performance. The results are visualized directly in JupyterLab and can thus be displayed together with the code. A case study evaluating the interaction between compression and MPI will demonstrate how JUMPED can be used.
pytest-isolate-mpi: Towards modern testing of MPI-parallel HPC applications
Sebastian Gottfried, Jordan Lavialle, Immo Huismann, German Aerospace Center (DLR)
Simulations are getting more complex; simulating with one code sufficed at the start of the century, modern simulations are multi-disciplinary. For instance aircraft are now being devised using multi-disciplinary optimization coupling Computational Fluid Dynamics codes with Computational Structure Mechanics. The resulting toolchains often couple welloptimized C/C++/Fortran codes using Python as glue code, as e.g. done in the FlowSimulator environment [1]. While the compute kernels of each discipline are typically tested with frameworks suited for the language, their Python glue code requires testing as well. However, the complexity inherited by each code needs to be taken into account: deadlocks, segfaults and MPI_Abort are typically not handled by unit testing frameworks, either fully precluding their usage or requiring the user to rely on exit codes of separate run scripts. This paper proposes an extension to pytest that handles these cases gracefully: pytest-isolate-mpi . By running each test in separate processes, deadlocks, segfaults, and MPI_Abort can be handled gracefully, while keeping the benefits of pytest , e.g. idiomatic test syntax and advanced test parametrization. Usage examples are presented, cornercases discussed, and ongoing extensions shown.