Portable Heterogeneous Programming with SYCL (PHPS22)
9 months ago
The PHPS22 workshop is co-located with ISC22 in Hamburg, Germany. To attend you must register for a workshop pass at ISC.
This workshop aims to focus on the experience of programming the current and next generation heterogeneous systems. It aims to be a platform for HPC/AI code developers, users, and providers of systems and tools to come together and share their experiences, successes, and challenges when using the Khronos open standard SYCL™ interface for programming heterogeneous systems. The workshop will feature a mix of peer-reviewed talks, lightning talks, and an invited keynote talk. In this full-day workshop attendees of the “Portable Heterogeneous Programming with SYCL” workshop will share their work to create a deeper understanding of vendor-neutral high-performance heterogeneous programming through talks and demonstrations using SYCL and related tools, plus interaction with academic and industry experts from Europe and around the world.
Within the last decade, heterogeneity has become the norm driven by performance and energy-efficiency requirements. Beyond offload devices such as GPGPUs and FPGAs, new specialized accelerators for machine learning, deep learning, and even IPUs and DPUs have emerged, further increasing this heterogeneity. There is also a clear mandate for co-design in hardware and open standard programming models in HPC. Given these emerging issues, it is imperative that the HPC community works together to enhance developer productivity by standardizing programming models and tools. This workshop will provide the forum for working through such critical issues of our time, developing solutions collaboratively across multiple vendors, DOE, and HPC using Khronos SYCL, a heterogeneous programming model based on ISO C++ for HPC. SYCL is an open, standards-based unified programming model from the Khronos Group that delivers a common developer experience across accelerator architectures - for faster application performance, more productivity, and greater innovation.
SYCL is being adopted as part of the Exascale Computing Project to enable code portability across various pre-exascale and exascale supercomputers. There are multiple implementations of SYCL including Beiming, ComputeCpp™, DPC++, hipSYCL, neoSYCL and triSYCL. These implementations provide a range of support for different vendor platforms including AMD, ARM®, Huawei®, Intel®, NEC, NVIDIA®, Renesas®. In addition, the oneAPI specification uses SYCL at its heart to enable multi-target support for a range of libraries alongside DPC++. As a result HPCwire, the leading publication for news and information for the high-performance computing industry, awarded oneAPI the 2021 HPCwire Readers’ Choice Award at the recently concluded Supercomputing Conference (SC21).
Date: Thursday 2nd June 2022
|14:00 CET||Keynote: Dynamic task fusion with SYCL for an explicit hyperbolic equation system solver with dynamic AMR and local time stepping||Tobias Weinzierl, Uni of Durham||Classic task parallelism is not a direct fit to SYCL (or GPUs in general), as tasks are often chosen to be tiny without further nested parallelism. If we deploy a task to a GPU, we however expect it to exploit the accelerator’s hardware concurrency and to be reasonably large. Within the ExaHyPE project, we work with tiny tasks. However, we do not deploy all tasks directly into the tasking runtime. Instead, we buffer them in application-specific queues. If many appropriate tasks “assemble” within this queue and if we identify that they don’t have any side-effects, i.e. alter the global simulator state, we merge them into one large meta-task and deploy this meta task one specialised compute kernel which can batch the computations over multiple patches, vectorise aggressively, and even deploy whole task assemblies to an accelerator. To support multiple vendors and to allow for a smooth transition into the OneAPI era, we implement this task merging layer on top of OpenMP, TBB and C++ threading, and designed the actual compute kernels such that they are interoperable with all three “backends”.||Download|
|14:30 CET||Extreme-Scale Scientific Software Stack||Sameer Shende, University of Oregon||The DOE Exascale Computing Project (ECP) Software Technology focus area is developing an HPC software ecosystem that will enable the efficient and performant execution of exascale applications. Through the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io], it is developing a comprehensive and coherent software stack that will enable application developers to productively write highly parallel applications that can portably target diverse exascale architectures. E4S provides both source builds through the Spack platform and a set of containers that feature a broad collection of HPC and AI/ML software packages that target GPUs. E4S exists to accelerate the development, deployment, and use of HPC software, lowering the barriers for HPC and AI/ML users. It provides container images, build manifests, and turn-key, from-source builds of popular HPC software packages developed as Software Development Kits (SDKs). This effort includes a broad range of areas including programming models and runtimes (MPICH, Kokkos, RAJA, OpenMPI), development tools (TAU, HPCToolkit, PAPI), math libraries (PETSc, Trilinos), data and visualization tools (Adios, HDF5, Paraview), and compilers (LLVM), all available through the Spack package manager. The talk will highlight tools such as the TAU Performance System(R) [http://tau.uoregon.edu] and its support of Intel’s OneAPI with support for instrumentation of applications using DPC++/SYCL on Intel GPUs and CPUs. E4S includes OneAPI and features base and full featured Docker and Singularity container images and support for cloud platforms including AWS.||Download|
|15:00 CET||Address Space Inference using ComputeCpp||Peter Zuzek, Codeplay Software||The SYCL specification defines pointer address spaces, which are a way to represent disjoint memory regions for specific hardware. In SYCL 2020 changes address space deduction rules to the point where the address spaces aren’t explicitly tied to C++ types anymore - one example is the introduction of Unified Shared Memory pointers, where raw pointers can be used in kernels, and another example is introduction of the generic address space, which moves the deduction burden further away from the user. In order to be able to tackle the changes required, Codeplay have been developing a new Address Space Inference system using an updated LLVM compiler that ships inside the ComputeCpp package. The new systems supports the abovementioned scenarios, but also presents a few changes compared to the previous rules. The new system supports the abovementioned scenarios, but also presents a few changes compared to the previous rules.||Download|
|15:30 CET||Toward performance-portable matrix-free solvers using SYCL||Igor Baratta, University of Cambridge||It has proved challenging in the past to achieve good performance for finite element methods on GPUs and accelerators. In part, this was due to the use of traditional sparse matrix structures to represent differential operators, which leads to algorithms and data structures with irregular memory accesses and which are severely memory bandwidth limited. To overcome this limitation, high-order matrix-free approaches are being actively researched for use on GPUs. In this talk, we discuss and demonstrate how performance-portable matrix-free solvers can be developed in SYCL and how the algorithms can be adapted to match the performance characteristics of the target hardware. We will also discuss how carefully arranging memory transfer and allocations can reduce latency and increase throughput in different accelerators, especially in the context of multi-GPU nodes. Finally, we will present performance results for different GPU architectures, including NVIDIA A100 and AMD Instinct™ MI100 accelerators.||Download|
|16:30 CET||The Cycle of Improving SYCL: Exploring DPC++ extensions with an eye to help future SYCL be even better||Igor Vorobtsov, Intel||SYCL is a Khronos standard that brings support for fully heterogeneous data parallelism to C++ and improves programming productivity on multiple hardware accelerators. SYCL 2020 is the newest release of the SYCL specification representing a major step forward, featuring over 40 new additions and improvements. Intel engineers in collaboration with the community and other vendors worked hard to contribute to the success of SYCL 2020. This work is going further and many new features are implemented in scope of Intel oneAPI DPC++/C++ Compiler to make future improvements to SYCL possible. We will discuss key extensions helping to simplify and make heterogeneous programming with SYCL even more efficient, e.g. Enqueued Barrier, Platform Default Contexts ,Discard Queue Events, Sub-group Load and Store, Local Memory etc.||-|
|16:50 CET||Targeting HPC accelerators in GROMACS using SYCL for performance and portability||Andrey Alekseenko, KTH (co-author Szilárd Páll, KTH)||This talk will discuss our experiences with adopting SYCL in GROMACS for portability as well as production-level performance. SYCL will be the the primary API to target Intel and AMD heterogeneous platforms, but it will also replace OpenCL as the GPU portability layer with the goal to support a broad range of current and future accelerator platforms. Currently the GROMACS SYCL support extends to all three major GPU vendors, AMD, Intel, NVIDIA, and we will present detailed results using multiple implementations including OpeAPI / DPC++ and hipSYCL. Finally we will discuss the future needs and requirements of the molecular simulation use-case and specifically GROMACS poses on the SYCL standard and its implementations.||Download|
|17:10 CET||Panel||PHPS22 Speakers||SYCL Working Group Chair Michael Wong will host a panel discussion with the PHPS22 presenters, bring your questions to this interactive session||-|
- Rod Burns, Codeplay®, United Kingdom
- Aleksander ILIC, University of Lisboa, Portugal
- Raja Appuswamy, Eurecom, France
- Michael Wong, SYCL Working Group Chair, Codeplay
- James Reinders, Intel Corp
- Kevin Harms, Argonne National Laboratory
- Tom Deakin, University of Bristol
- Ruyman Reyes, Codeplay Software
- Garth Wells, University of Cambridge
- Eric Lindahl, KTH Royal Institute of Technology
- Thomas Steinke, Zuse Institute Berlin
- Peter Zuzek, Codeplay Software
- Clayton Hughes, Sandia National Laboratories