Pruning and Quantization are effective Deep Neural Network (DNN) compression methods for optimized inference on various hardware platforms. Pruning reduces the size of a DNN by removing redundant parameters, while Quantization lowers the precision. The advances in accelerator design propelled efficient training and inference of DNNs. Hardware-Aware Neural Architecture Search (HW-NAS) aims to automatically search for neural architectures that maximize accuracy and hardware performance metrics.
We develop Mixed Sparse and Precision Search (MSPS) to search for sparse and mixed-precision quantized models. We extend MSPS by developing Architecture, Sparse and Precision Search (ASPS) to jointly search for neural architectural parameters and sparse-precision combinations. We also develop Array Aware Pruning (AAP) to prune a network model with respect to the dimension of the computing array of the hardware accelerator. We illustrate the effectiveness of our MSPS and ASPS methods on the A100 Tensor Core GPU and AAP on the Eyeriss Accelerator.
Bio: Krishna Teja Chitty-Venkata is currently a PhD Candidate in Computer Engineering at Iowa State University, working under the supervision of Prof. Arun K. Somani. He earned his Bachelor of Engineering from Osmania University, India, majoring in Electronics and Communication. He primarily works at the intersection of Systems and Deep Learning. His interests include Machine Learning, HPC, AI Accelerators and neural network optimization. His PhD research involved optimizing Neural Networks for efficient inference on special-purpose accelerators and general-purpose hardware platforms using techniques like pruning, quantization and Neural Architecture Search. He interned at AMD, Intel and Argonne National Laboratory during his PhD study.