While using general purpose graphics processing units to accelerate the performance of a numerical method, we often need to reengineer both the algorithm and the implementation strategy. In my talk, I will show two approaches to achieve performance improvement.
I will first discuss Krylov subspace linear system solvers and optimizing preconditioners through the development of an algorithm that allows us to take advantage of the GPU single instruction multiple data multilevel threaded parallelism.
In the second part of the talk I will discuss code transformation based optimization for high-order finite element methods, including optimizing gradient volume kernel for 2D hexagonal elements and fine-tuning BP1.0, BP3.5 and BP3.0 benchmark problems (CEED benchmarks). I will introduce empirical roofline models and show a detailed performance analysis of the tuning of the benchmark implementations.