“Execution of Independent SYCL commands may overlap” is an optimization that SYCL applications developers would like to rely on. By executing commands concurrently, developers hope that their code will run faster. This poster uses this empirical metric to assess if a computing environment lives up to developers’ expectations. We run each individual command serially to generate a baseline and then check if the same commands run faster when scheduled in a way that allows concurrency. The SYCL specification allows concurrent execution of independent commands when they are scheduled in an out-of-order queue or when they are scheduled to multiples, possibly in-order, queues. We tested four different kinds of independent command, both in the “multiple in-order-queues” and “single queue out-of-order” modes: Two Compute kernels, each kernel having a low occupancy One Compute kernels and a one memory copy from system allocated Memory to a Device buffer (M2D) One Compute kernel and a memory copy from Device buffer to system allocated Memory (D2M) One M2D and one D2M The poster’s contribution is twofold: Firstly, the source code used for these experiments has been made open-source (https://github.com/argonne-lcf/HPC-Patterns/tree/main/concurency) so that others can evaluate these different approaches to concurrency. Our code uses USM for the memory transfer and relies on “clpeak like” kernel for the compute part (https://github.com/krrishnarraj/clpeak/blob/master/src/kernels/compute_dp_kernels.cl). Memory buffers used are as large as possible for USM allocation (`sycl::info::device::max_mem_alloc_size`) to minimize runtime overhead with respect to execution time. The number of FMA used for the compute kernel is chosen so that the execution time of the compute kernel and data-transfers are similar. Secondly, We tested multiple Sycl compilers, targeting multiple backends, on multiple hardware (at the time where this abstract was written: DPCPP / OpenCL / Gen, DPCPP / L0 / Gen9, DPCPP / CUDA / A100, HipSYCL / Hip / MI100. We plan to measure more). Results are mitigated, with some environments achieving concurrency in most tests, others in none. It is also is interesting to note that enabling profiling in queues will serialize commands in some environments. Speaker: Thomas Applencourt (Argonne National Laboratory) Co-Authors: Abhishek Bagusetty (Argonne National Laboratory) and Aksel Alpay (Heidelberg University)