This session will present the use of NsightCompute for analyzing the performance of individual GPU kernels on NVIDIA GPUs. We will walk through some simple compute kernels which are compute-bound and memory bandwidth-bound and learn how to profile them with Nsight Compute, generate roofline charts, and analyze the performance of those kernels. We will then introduce a sample realistic kernel from an HPC application and discuss how comprehensive kernel analysis can be used in an iterative process to substantially speed up key application bottlenecks. The goal is for the user to be able to determine whether the performance of a compute construct is “good enough” relative to the capabilities of the hardware and, if not, what steps should be taken to address this.