Distributed and Heterogeneous Programming in C++ (DHPCC++22)
7 months ago
Technology trends require software/hardware co-design for HPC systems. Open-standard software programming models enable this co-design and allows us to support initiatives such as European Processor Initiative (EPI) supporting RISC-V® processors as accelerators. Indeed, the need for distributed and heterogeneous programming models is so urgent that we now have many examples to draw from. By focusing on open ISO languages such as C++ with the addition of heterogeneous/distributed offloads to support the newest accelerator hardware, this workshop aims to isolate and unify the learnings from the many different models for the Performance, Portability, and Productivity equation.
This will be the 4th DHPCC++ workshop (previously held 2017-19) with a focus on heterogeneous programming models for C and C++, covering all the programming models that have been designed to support heterogeneous programming in C++. It is held as part of Euro-Par 22 in Glasgow, Scotland 22-26 August 2022.
- Workshop Date: 23rd August 2022 9:00am - 5:30pm
Order of Events
|9:15am||Keynote: The Future of Heterogeneous Computing with C++ and SYCL||Michael Wong, ISO C++, SYCL Chairperson, Distinguished Engineer, Codeplay Software||-||Download|
|10:15am BST||Bringing Performance Portability to the Exascale Era with C++ and SYCL||Kevin Harms, Argonne National Laboratory||-||Download|
|11:30am BST||Performance Portability Evaluation: Non-negative Matrix Factorization as a case study||Youssef Faqir-Rhazoui, Carlos Garcia Sanchez, Francisco Tirado, Universidad de Madrid||SYCL standard has been released with the conviction to increase code portability in heterogeneous environments. On its side, Intel has launched the oneAPI toolkit, which includes the Data Parallel C++ language, the Intel implementation of SYCL. SYCL is designed to use a single source code to target multiple accelerators, such as multi-core CPUs, GPUs, or even FPGAs. Additionally, the C/C++ oneAPI compiler also supports OpenMP which also allows targeting CPU and GPU devices. In this paper, a performance evaluation of SYCL and OpenMP is carried out using the well-known, Non-negative Matrix Factorization (NMF) algorithm. Three different NMF implementations (baseline, SYCL and OpenMP) are developed to analyze the speedups on both CPU and GPU. Experimental results show that while on CPU both programming models report almost the same performance, on GPUs, SYCL slightly outperforms OpenMP counterpart.||Download|
|2:00pm BST||Providing SYCL backend for embedded Xilinx FPGA using ComputeCpp||Pietro Ghiglio, Anton Mitkov, Uwe Dolinsky, Kumudha Narasimhan, Mehdi Goli, Codeplay Software||Modern Machine Learning (ML) and Artificial Intelligence (AI), in particular, Deep Learning, requires efficient linear algebra implementations and accelerator platforms. Reconfigurable technologies are a promising direction towards having specialized AI and ML hardware due to performance, low power consumption and their re-targetability. They typically require far less power and can deliver nearly the same energy efficiency as GPU devices. However, a lack of high-level parallel framework support on FPGAs remains a challenge to utilize them effectively.Portability is a key requirement for fast and productive reuse and deployment of the same software across a wide range of different types of accelerators. SYCL as an open-standard, high-level parallel programming model provides this portability not only at the API level, but also at the compiler level that helps improve performance portability when targeting the same code to radically different hardware. Enabling SYCL on FPGA widens the support of existing SYCL software and tools ecosystems to target FPGA based devices. FPGAs are intrinsically very different from other platforms (e.g., GPUs and multicore CPUs): while the parallelism in CPUS and GPUs is exposed by means of threads executed in parallel, leading to task-level parallelism, FPGAs expose a much finer level of parallelism. Therefore, kernels written specifically for FPGAs can differ quite drastically from kernels written with CPUs or GPUs in mind. In order to achieve better performance portability of SYCL applications on CPUs, GPUs and FPGAs, better (and target-specific) code transformation need to be performed. In this work, we present a SYCL implementation that is able to target existing open-source SYCL ecosystem code on Xilinx FPGAs, particularly, embedded platforms. We describe our integration of Xilinx’s VitisTM tools  in the ComputeCpp device compiler. This integration allows to exploit the usual device code extraction performed by the SYCL implementation, and converts it into the bitstream format which is then used to configure the FPGA. The bitstream file is bundled in the integration header, replacing the SPIR/SPIR-V code that the header usually contains and enabling loading the bitstream by the open source XRT runtime. While this integration is similar to TriSYCL it benefits from ComputeCpp’s tools flow and optimisations to target embedded and resource-constrained devices.||Download|
|2:40pm BST||MDSPAN: A Deep Dive Spanning Kokkos, C++ & SYCL||Nevin Liber, Argonne National Laboratory||This talk is a deep dive into the history behind MDSPAN (it’s roots being in Kokkos::View), the C++ standardization effort behind it (current status, various tradeoffs made over time, and language changes to help support it) and how SYCL is looking to leverage it in the future. This talk will cover Kokkos::View, The ill-fated array_view proposal for C++, How the C++ committee morphed Kokkos::View into MDSPAN, C++ language changes to improve MDSPAN, MDSPAN interface, MDARRAY, Kokkos using MDSPAN, SYCL-Next, MDSPAN and removing three dimensional limits||Download|
|4:00pm BST||Portable Uintah framework for heterogeneous, asynchronous many-task runtime systems based on SYCL||Abhishek Bagusetty, John Holmen and Martin Berzins, ANL, ORNL, University of Utah||The Uintah computational framework (UCF), an asynchronous many-task runtime system, has evolved over time to adapt the MPI+X hybrid parallelism approach that has shown promise for distributed, heterogeneous CPU-GPU computing architectures. The UCF was developed to provide an environment for solving fluid-structure interaction problems on structured adaptive grids on large-scale, long-running, data-intensive problems with a novel asynchronous task-based approach with fully automated load balancing. Our most recent work involved using ISO C++ based single-source, cross-platform abstractions such as SYCL for our MPI+X model. The preliminary SYCL implementation of components in the UCF were related to porting an existing CUDA implementation. This porting effort has shed some light into several key differences between programming models related to design heuristics, compiler-runtime implementations and user code maintenance. Moreover, we also like to discuss our results related to comparing performance metrics from SYCL in comparison to using native implementation of CUDA, HIP on their supported respective hardware. Given the portability nature of SYCL, we will discuss our results related to evaluating existing SYCL implementations with our in-house radiation model to understand SYCL’s performance and portability.||N/A|
|4:30pm BST||A SYCL Extension for User-Driven Online Kernel Fusion||Víctor Pérez-Carrasco, Lukas Sommer, Victor Lomüller, Kumudha Narasimhan and Mehdi Goli||Heterogeneous programming models such as SYCL are becoming increasingly popular, as they allow to integrate a wide variety of specialized accelerators found in today’s heterogeneous systems into an application with ease. By offloading specific tasks of the application to specialized accelerators, these programming models can achieve portable performance. While this approach can deliver significant improvements in application performance in many cases, short-running device kernels remain a challenge for most heterogeneous programming models.The overhead caused by the necessary data transfers, synchronization between host and device, the kernel launch itself and the loss of locality leads to noticeable performance losses when launching many small device kernels. Such performance degradation can be significant with repeated usage of small kernels in applications with graph-based algorithms which is the backbone of machine learning frameworks. The different operators in neural network models are typically mapped to a device kernel and for memory-bound operators like Relu, this leads to the launch of many small kernels. One potential solution to address this problem is to merge multiple of these small, memory-bound, short-running kernels into a single larger kernel. This leads to better use of the device’s resources and amortizes the device launch overhead. Hence, this kind of operation fusion is a common optimization in many domains, including machine learning frameworks such as ONNX runtime , pytorch , etc, employ this optimization. However, given the huge set of potential combinations, manually creating fused versions of kernels is a time-consuming and also error-prone task. This can push programmers to seek a trade-off between (a) fused, task-specific kernels, which are hard to maintain, or (b) a set of smaller, modular kernels, which are easier to maintain, but whose launch carries additional overhead. To address this problem, this work proposes an extension to the SYCL API for a user-driven, online kernel fusion. The proposed extension gives users or software frameworks (e.g., neural network inference engines) using SYCL, a way to automatically fuse multiple SYCL device kernels at runtime, without the need for manual implementation of the fused version of the kernel. Users or software frameworks can use their application and domain knowledge, as well as runtime context information, to determine when fusion of kernels is legal and profitable, while the actual process of creating a fused kernel is automated by the SYCL runtime.||Download|
|5:00pm BST||Acceleration of sparse matrix linear solvers for circuit simulation||Danial Chitnis||Transistor-level circuit simulation is an essential part of integrated circuit design, especially when optimising high speed and low power design is needed. A circuit simulation consists of multiple components, including netlist parsing, device evaluation, and linear solvers. The non-linear elements, such as transistors in each circuit, are substituted with linear models and solved using conventional linear systems, which consume most of the computation resources in a circuit simulation. Hence, accelerating the linear solver has a direct impact on reducing simulation times. In order to solve these linear systems, a matrix is created based on the conductance of the components within the circuit. The size of this matrix is equal to the number of nodes within the circuit. The conductance between each node forms the matrix elements. Therefore, the resulting matrix is usually square, symmetric, and sparse because the connections between each node are mainly localised. Since a typical circuit consists of many thousands to millions of transistors, each containing dozens of nodes, the equivalent conductance matrix is large and sparse. These large matrices are solved using LU decomposition, a direct method for solving linear systems by factorising them into a lower and upper triangular matrix. Over the years, various sparse LU solvers have been proposed. These include SuperLU, UMFPACK, Pardiso (part of oneMKL), and KLU. Among these solvers, KLU is explicitly optimised for circuit-based matrices, which tend to have dense rows. The computational complexity of KLU is strongly dependent on the dimension of the matrix and the pattern of non-zero elements within the matrix. Hence, the size and topology of the circuit could significantly impact the simulation time. These solving algorithms have been traditionally hard to parallelise due to dependent nested for-loops and indirect indexing within their core algorithm. Hence, circuit simulation has had a limited benefit from high-performance computing, including multi-CPU configurations and GPU accelerators.In this presentation, we propose a new paradigm in circuit simulation which takes advantage of parallelisation and vectorisation in modern heterogeneous hardware to reduce the overall simulation time and increase productivity in integrated circuit design. This paradigm is based on the inherent parallelism within the workflow of integrated circuit design. These workflows include parametric sweeps in DC simulations, convergence in transient simulations, and Monte Carlo analysis to investigate the effects of device variations. We will demonstrate results from our modified KLU algorithm that takes advantage of multi-threading and vectorisation in modern CPUs using ISO C++ language. We will present benchmarking results of our modified KLU on CPU versus the original algorithm across a wide range of circuit matrices, demonstrating an overall acceleration in the total solve time up to 10x based on the non-zero pattern and size of the matrix. We discuss the effects of multi-threading and vectorisation in our CPU implementation, including the roofline model and the optimum resource configuration to achieve the shortest solve time. Additionally, we will present results from implementing our modified KLU solver in an FPGA accelerator card using High-Level Synthesis (HLS). These include benchmarking results and the effect of FPGA specific optimisation techniques, including pipelining, unrolling, and array partitioning. Our results demonstrate that array portioning has the most impact among these optimisation techniques with up to 8x acceleration on selected matrices. We will discuss the challenges of implementing the KLU solver in CPU and FPGA for small and large circuit matrices to achieve acceleration and lower power consumption. We believe that our new paradigm in circuit simulation provides a novel perspective on acceleration physics-based simulations in general, which have traditionally been challenging to parallelise. This perspective paves the way for better optimisation techniques and intelligent autonomous chip design for next-generation electronic and cad software.||Download|
Many C++ programming models exist including SYCL™, HPX, KoKKos, Alpaka and Raja. This workshop aims to address the needs of both HPC and the consumer/embedded/datacentre community where several C++ parallel programming frameworks have been developed to address the needs of multi-threaded and distributed applications. The C++11/14/17/20 International Standards have introduced new tools for parallel programming to the language, and the ongoing standardization effort is developing additional features which will enable support for heterogeneous and distributed parallelism. Additional standards have built on top of these ISO languages for heterogeneous and distributed programming. This conference is an ideal place to discuss research in this domain, consolidate usage experience, and share new directions to support new hardware and memory models with the aim of passing that experience to ISO C++. These programming models will enable future support for pre-Exascale, Exascale, and Zettascale computing.
- Ruth Falconer University of Abertay, UK
- Aksel Alpay, University of Heidelberg, Germany
- David Bernholdt, ORNL, US
- James Brodman, Intel, US
- Danial Chitnis, University of Edinburgh, UK
- Biagio Cosenza, University of Salerno, Italy
- Andrey Alekseenko, KTH, Sweden
- Mehdi Goli, Codeplay Software, UK
- Kevin Harms, Argonne National Laboratory, US
- Ronan Keryell, AMD, US
- Erik Lindahl, Stockholm University, Sweden
- Axel Naumann, CERN, Switzerland
- Vincent Pascuzzi, Brookhaven National Laboratory, US
- Ruyman Reyes, Codeplay Software, UK
- Ricardo Sanchez Schulz, University of Conception, Chile
- Rod Burns, Codeplay Software, UK
- Jose Cano, University of Glasgow, UK
- Michael Wong, Codeplay Software, UK
- Garth Wells, University of Cambridge, UK
Full Academic Papers
Please submit full papers for review, which should be no longer than 12 pages using the the Springer LNCS style.
For authors of accepted papers:
- The full paper will be included in the proceedings published in the Springer Library.
- Will submit the final camera ready paper using the template by the deadline sent by the organizers.
- The video presentation and slides will be published on the workshop web page.
Submissions require a 500-1000 word abstract. Additional supporting materials may be submitted as a single pdf document.