Research Papers & Benchmarks

If you are looking at using SYCL as a programming model for heterogeneous and parallel software development there are a wide variety of published independent research papers. Here we have linked details for some of these papers.

Research Papers

  • Fast Merge Tree Computation via SYCL

    Authors: Arnur Nigmetov, Dmitriy Morozov

    A merge tree is a topological descriptor of a real-valued function. Merge trees are used in visualization and topological data analysis, either directly or as a means to another end: computing a 0-dimensional persistence diagram, identifying connected components, performing topological simplification, etc. Scientific computing relies more and more on GPUs to achieve fast, scalable computation. For efficiency, data analysis should...

    View Paper
  • Comparing SYCL™ Data Transfer Strategies for Tracking Use Cases

    Authors: S Joube, H Grasland, D Chamont and E Brunet

    The aim of this work is to compare the performance and ease of programming of the various data transfer strategies provided by SYCL 2020: buffers/accessors on one hand and the different storage types exposed by Unified Shared Memory (USM) on the other hand. We measured the relative performance of USM exclusively located either on the host (USM host) or on...

    View Paper
  • Evaluation of Intel's DPC++ Compatibility Tool in heterogeneous computing

    Authors: German Castano, Youssef Faqir-Rhazoui, Carlos Garcia, Manual Prieto-Matias

    "DPCT greatly streamlines the migration process from CUDA to oneAPI. Twenty out of the twenty three benchmarks were successfully migrated without major developer interventions.• Memory operations (device memory management operations and data transfers between host and device memories) take roughly the same time in the migrated and native codes.• While some migrated applications achieved similar performance to the original CUDA...

    View Paper
  • SYCL Code Generation for Multigrid Methods

    Authors: Stefan Groth, Christian Schmitt, Jürgen Teich, and Frank Hannig

    Multigrid methods are fast and scalable numerical solvers for partial differential equations (PDEs) that possess a large design space forimplementing their algorithmic components. Code generation ap-proaches allow formulating multigrid methods on a higher level of abstraction that can then be used to define a problem and hardware-specific solution. Since these problems have considerable implementation variability, it is crucial to define...

    View Paper
  • Performance Portability of Multi-Material Kernels

    Authors: Istvan Z. Reguly

    Trying to improve performance, portability, and productivity ance portability and code divergence metrics, contrasting performance, portability, and productivityof an application presents non-trivial trade-offs, which are often difficult to quantify. Recent work has developed metrics for performance portability, as well some aspects of productivity - in this case study, we present a set of challenging computational kernels and their implementations from...

    View Paper
  • Performance portability of a Wilson Dslash Stencil Operator Mini-App using Kokkos and SYCL

    Authors: Balint Joo, Thorsten Kurth, M. A. Clark, Jeongnim Kim, Christian R. Trott, Dan Ibanez, Dan Sunderland, Jack Deslippe

    We describe our experiences in creating mini-apps for the Wilson-Dslash stencil operator for Lattice Quantum Chromo dynamics using the Kokkos and SYCL programming models. In particular we comment on the performance achieved on a variety of hardware architectures, limitations we have reached in both programming models and how these have been resolved by us, or may be resolved by the...

    View Paper
  • Innovative language extensions for accelerator cards using the example of SYCL, HC, HIP and CUDA: research on usability and performance

    Authors: Jan Stephan, Dr. Wolfgang E. Nagel

    Translated from German: “The purpose of this work is a comparative analysis of the programming models CUDA, SYCL and ROCm (or HC and HIP) on GPUs of the manufacturers NVIDIA and AMD. On the one hand, the skills and concepts underlying the respective models are to be compared, on the other hand the concrete achievable performance is to be determined...

    View Paper
  • Celerity: High-level C++ for Accelerator Clusters

    Authors: Peter Thoman, Philip Salzmann, Biagio Cosenza, and Thomas Fahringer

    In the face of ever-slowing single-thread performance growthfor CPUs, the scientific and engineering communities increasingly turn toaccelerator parallelization to tackle growing application workloads. Ex-isting means of targeting distributed memory accelerator clusters imposesevere programmability barriers and maintenance burdens. The Celerity programming environment seeks to enable developers toscale C++ applications to accelerator clusters with relative ease, whileleveraging and extending the SYCL domain-specific...

    View Paper
  • Improving the Performance of Medical Imaging Applications using SYCL

    Authors: Zheming Jin

    In this report, we are interested in applying the SYCL programming model to medical imaging applications for a study on performance portability and programming productivity. The SYCL standard specifies a cross-platform abstraction layer that enables programming of heterogeneous computing systems using standard C++. As opposed to the Open Computing Language (OpenCL) programming model, in which host and device code are...

    View Paper


  • RSBench

    RSBench is a mini-app representing a key computational kernel of the Monte Carlo neutron transport algorithm.

    View Benchmarks
  • ParResKernels

    Parallel Research Kernels is a suite that contains a number of kernel operations, plus a simple build system intended for a Linux-compatible environment. Most of the code relies on open standard programming models including SYCL and thus can be executed on many computing systems.

    View Benchmarks
  • BabelStream

    BabelStream is a benchmark used to measure the memory transfer rates to/from capacity memory. Unlike other memory bandwidth benchmarks this does not include any PCIe transfer time for attached devices.

    View Benchmarks