**Aurora Early Adopters Series** ### Overview of the Intel® oneAPI Math Kernel Library Peter Caday ## What's in Intel® oneMKL (Beta)? **LINEAR ALGEBRA VECTOR RNGS VECTOR MATH SUMMARY AND MORE FFT STATISTICS** Trigonometric BLAS Kurtosis **Splines** Multidimensional Engines Hyperbolic Variation **LAPACK** coefficient Exponential **ScaLAPACK** Order statistics Interpolation Logarithmic Min/Max Sparse BLAS Power Cluster FFT Distributions **Fast Poisson** Variance-Solver Sparse solvers Root covariance DPC++/C/C++/Fortran API DPC++ API with GPU C/C++/Fortran CPU with GPU support support only support ## Zoom in: Dense Linear Algebra + FFT LAPACK BLAS FFT Level 1 LU/QR Complex (vector ops) 1D/2D/3D DPC++/OpenMP offload Cholesky Level 2 with GPU support (matrix-vector) Real-to-complex Eigensolvers DPC++ API Level 3 Complex-to-real with GPU support (matrix-matrix) 1D/2D/3D Batch factorizations Level 3 CPU support only extensions **Utility routines** Cluster FFT Batched ScaLAPACK ## Potential GPU Usage Models **Example**: multiply double-precision matrices $C \leftarrow AB$ - On host (A/B/C in host memory) - Automatic offload to GPU (A/B/C on h)st - Operats Pot ff Place (\*) GP/U transfer overhead! - Manual offload (A/B/C on host or device) - Inside device kernel (A/B/C on device) ``` dgemm(..., A, ..., B, ..., C, ...); dgemm(..., A, ..., B, ..., C, ...); #pragma omp target variant dispatch ... dgemm(..., A, ..., B, ..., C, ...); dgemm(device_queue, ..., A, ..., B, ..., C, ...); void my_kernel(...) { dgemm(..., A, ..., B, ..., C, ...); } ``` ## oneMKL GPU Usage Models | | Automatic offload | OpenMP offload | Manual offload | Device API | |-----------------|-------------------|-------------------------------|------------------------------|--------------------------------| | Invocation side | CPU | | GPU | | | Data location | CPU | GPU / CPU | GPU / CPU / shared | GPU | | Interface | C/C++/Fortran | C/C++/Fortran +<br>OpenMP | DPC++ | DPC++ | | EOY support | None | Most oneMKL GPU functionality | All oneMKL GPU functionality | Limited support (selected RNG) | | Less Control | More Control | | | |--------------------|-------------------|--|--| | | | | | | Fewer Code Changes | More Code Changes | | | ## Using oneMKL OpenMP Offload Interfaces ## Offload: Key OpenMP Directives (C) ``` #pragma omp target data ``` Map host-side variables to device variables inside this block. ``` #pragma omp target enter data #pragma omp target exit data ``` Map/unmap host-side variables to device variables: the two halves of #pragma omp target data. ``` #pragma omp target ``` Offload execution of block to the GPU. ``` #pragma omp target variant dispatch ``` Offload oneMKL calls inside this block to the GPU. #### **GEMM** with oneMKL C API ``` int main() { long m = 10, n = 6, k = 8, lda = 12, ldb = 8, ldc = 10; long sizea = lda * k, sizeb = ldb * n, sizec = ldc * n; double alpha = 1.0, beta = 0.0; // Allocate matrices double *A = (double *) mkl malloc(sizeof(double) * sizea, 64); double *B = (double *) mkl malloc(sizeof(double) * sizeb, 64); double *C = (double *) mkl malloc(sizeof(double) * sizec, 64); // Initialize matrices [...] // Compute C = A * B on CPU cblas dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, 1da, B, 1db, beta, C, 1dc); ``` $C \leftarrow \alpha AB + \beta C$ ## GEMM with oneMKL C OpenMP Offload ``` int main() { long m = 10, n = 6, k = 8, lda = 12, ldb = 8, ldc = 10; long sizea = lda * k, sizeb = ldb * n, sizec = ldc * n; double alpha = 1.0, beta = 0.0; // Allocate matrices double *A = (double *) mkl malloc(sizeof(double) * sizea, 64); double *B = (double *) mkl malloc(sizeof(double) * sizeb, 64); double *C = (double *) mkl malloc(sizeof(double) * sizec, 64); // Initialize matrices [...] #pragma omp target data map(to:A[0:sizea],B[0:sizeb]) map(tofrom:C[0:sizec]) #pragma omp target variant dispatch use_device_ptr(A, B, C) nowait // Compute C = A * B on GPU cblas dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,\m, n, k, alpha, A, lda, B, ldb, beta, C, ldc); Optional nowait clause for asynchronous execution ``` Use #pragma omp taskwait for synchronization $C \leftarrow \alpha AB + \beta C$ Use target data map to send matrices to the device Use target variant dispatch to request GPU execution for cblas\_dgemm List mapped device pointers in the use\_device\_ptr clause ## GEMM with oneMKL Fortran OpenMP Offload ``` ... // module files include "mkl omp offload.f90" ◆ Module for Fortran OpenMP offload program main use mkl blas omp offload integer :: m = 10, n = 6, k = 8, lda = 12, ldb = 8, ldc = 10 integer :: sizea = lda * k, sizeb = ldb * n, sizec = ldc * n Use target data map to double :: alpha = 1.0, beta = 0.0 double, allocatable :: A(:), B(:), C(:) send matrices to the device // Allocate matrices... allocate(A(sizea)) Use target variant dispatch to request GPU execution for dgemm // Initialize matrices... !$omp target data map(to:A(1:sizea), B(1:sizeb)) map(tofrom:C(1:sizec)) !$omp target variant dispatch use_device_ptr(A, B, C) nowait List mapped device pointers in ! Compute C = A * B on GPU the use_device_ptr clause call dgemm('N', 'N', m, n, k, alpha, A, lda, B, ldb, beta, C, ldc) !$omp end target variant dispatch !$omp end target data Optional nowait clause for asynchronous execution Use !$omp taskwait for synchronization end program ``` ## Using oneMKL DPC++ Interfaces ## Data Parallel C++ (DPC++) Introduction - **SYCL** is a C++-based, single-source programming language for heterogeneous computing. - DPC++ is SYCL + many new extensions. - e.g. pointer-based programming (Unified Shared Memory) - Open, standards-based, multi-vendor. https://software.intel.com/en-us/oneapi ## DPC++: Key SYCL Constructs for oneMKL Usage ``` sycl::queue Q{sycl::cpu_selector{}}; sycl::queue Q{sycl::gpu_selector{}}; sycl::queue Q{device}; ``` Create device queue attached to a given device or device type. All device execution goes through a queue object. ``` void *mem = sycl::malloc_shared(bytes, Q); void *mem = sycl::malloc_device(bytes, Q); ``` Allocate device-accessible memory. malloc\_shared memory is also accessible from the host. ``` sycl::buffer<T,1> mem(elements); sycl::buffer<T,1> mem(elements, hostptr); ``` Smart buffer object. Migrates memory automatically and tracks data dependencies. Can be attached to host memory (synchronized at creation and destruction). #### GEMM with one MKL CAPI ``` int main() { int64 t m = 10, n = 6, k = 8, lda = 12, ldb = 8, ldc = 10; int64 t sizea = lda * k, sizeb = ldb * n, sizec = ldc * n; double alpha = 1.0, beta = 0.0; // Allocate matrices double *A = (double *) mkl malloc(sizeof(double) * sizea); double *B = (double *) mkl malloc(sizeof(double) * sizeb); double *C = (double *) mkl malloc(sizeof(double) * sizec); // Initialize matrices [...] cblas dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, 1da, B, 1db, beta, C, 1dc); ``` $C \leftarrow \alpha AB + \beta C$ #### GEMM with oneMKL DPC++ ``` int main() { using namespace oneapi::mkl; int64 t m = 10, n = 6, k = 8, lda = 12, ldb = 8, ldc = 10; int64 t sizea = lda * k, sizeb = ldb * n, sizec = ldc * n; double alpha = 1.0, beta = 0.0; sycl::queue Q{sycl::gpu selector{}}; // Allocate matrices double *A = malloc_shared<double>(sizea, Q); __ double *B = malloc shared<double>(sizeb, 0); double *C = malloc shared<double>(sizec, 0); // Initialize matrices [...] auto e = blas::gemm(Q, transpose::N, transpose::N, m, n, k, _ alpha, A, lda, B, ldb, beta, C, ldc); Output e is a sycl::event object representing command completion Call e.wait() to wait for completion ``` $C \leftarrow \alpha AB + \beta C$ Set up GPU queue Allocate CPU/GPU-accessible shared memory New oneMKL DPC++ API Computation is performed on given queue ## Batch Computations in oneMKL ## **Batching Overview** - Execute multiple **independent** operations of the same type in a single call - e.g. invert 100 different 8x8 matrices - Benefits: increased parallelism, reduced overhead - Available APIs: - BLAS: gemm, trsm, axpy - LAPACK\*: LU (getrf, getri, getrs), Cholesky (potrf, potrs), QR (geqrf, orgqr, ungqr) - FFT: all DFTs ## BLAS/LAPACK Group APIs **Group** = set of operations with identical parameters (size, transpose...) but different matrix/vector data Group batch APIs process one or more groups simultaneously. ## BLAS/LAPACK Group APIs **Group** = set of operations with identical parameters (size, transpose...) but different matrix/vector data Group batch APIs process one or more groups simultaneously. #### Examples: - n operations, 1 group: all parameters identical - *n* operations, *n* groups: each operation has different parameters ## Example: Batch DGEMM cblas\_dgemm\_batch(CblasColMajor, transA, transB, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc, num\_groups, group\_sizes); group\_size[1] group\_size[0] a • • • m[0] ... • • • k[0] k[1] m[0] m[1]1 array entry per group k[0] k[1] 1 array entry per GEMM operation ## BLAS/LAPACK Strided DPC++ APIs - DPC++ oneMKL adds strided APIs for simple batch cases - Single group: all matrix/vector sizes, parameters are homogeneous - Fixed stride between successive matrices/vectors in batch - Base address + stride replaces array of pointers - Strides on inputs may be zero to reuse an input for all operations in the batch ## Strided Batch Snippet - LU ``` #include "oneapi/mkl.hpp" using namespace oneapi::mkl; int64 t batch = 100; // 100 matrices int64 t N = 10; // Matrix size - square in this example int64 t stride = N * N; // 10x10 matrices are contiguous in memory int64 t stride piv = N; // Pivot entries also contiguous in memory sycl::queue Q{sycl::gpu selector{}}; // Allocate memory for matrices and pivot indices, as well as scratch space. auto A array = sycl::malloc shared<double>(stride * batch, Q); auto pivot array = sycl::malloc shared<double>(stride piv * batch, 0); auto scratch_size = lapack::getrf_batch_scratchpad_size(Q, n, n, n, stride, stride_piv, batch); = sycl::malloc shared<double>(scratch size, Q); auto scratch // [Initialize A array here] // Batch computation lapack::getrf batch(Q, n, n, A array, n, stride, pivot array, stride piv, scratch, scratch size) .wait(); ``` ## oneMKL Resources ## Resources | Websites | https://software.intel.com/en-us/intel-mkl | | |---------------------|-----------------------------------------------------------------------|--| | | https://software.intel.com/en-us/oneapi/onemkl | | | Forum | https://software.intel.com/en-us/forums/intel-math-kernel-library | | | Developer Reference | https://software.intel.com/en-us/oneapi-mkl-dpcpp-developer-reference | | | Link Line Advisor | http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor | | | Benchmarks | https://software.intel.com/en-us/intel-mkl/benchmarks | | #### **Notices & Disclaimers** This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. The benchmark results reported herein may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="https://www.intel.com/benchmarks">www.intel.com/benchmarks</a>. INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Copyright © 2020, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and OpenVINO are trademarks of Intel Corporation or its subsidiaries in the U.S. and other countries. Khronos® is a registered trademark and SYCL is a trademark of the Khronos Group, Inc. #### **Optimization Notice** Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 ## **Questions & Answers** #