A Precision-Scalable Energy-Efficient Bit-Split-and-Combination Vector Systolic Accelerator for NAS-Optimized DNNs on Edge

K. Li, J. Zhou, Y. Wang, J. Luo, Z. Yang, S. Yang, W. Mao, M. Huang, H. Yu

Design, Automation and Test in Europe Conference (DATE)

PAPER / SILDES / VIDEO / SHORT


Abstract

Optimized model and energy-efficient hardware are both required for deep neural networks (DNNs) in edge-computing area. Neural architecture search (NAS) methods are employed for DNN model optimization with resulted multi-precision networks. Previous works have proposed low-precision-combination (LPC) and high-precision-split (HPS) methods for multi-precision networks, which are not energy-efficient for precision-scalable vector implementation. In this paper, a BSC-based vector systolic accelerator is developed for a precision-scalable energy-efficient convolution on edge. The maximum energy efficiency of the proposed BSC vector processing element (PE) is up to 1.95× higher in 2-bit, 4-bit and 8-bit operations when compared with LPC and HPS PEs. Further with NAS optimized multi-precision CNN networks, the averaged energy efficiency of the proposed vector systolic BSC PE array achieves up to 2.18× higher in 2bit, 4-bit and 8-bit operations than that of LPC and HPS PE arrays.

Read more

An Energy-Efficient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks

L. Dai, Q. Cheng, Y. Wang, G. Huang, J. Zhou, K. Li, W. Mao, H. Yu

Design Automation Conference in Asia and South Pacific Region (ASP-DAC)

PAPER / SILDES / VIDEO


Abstract

Optimized convolutional neural network (CNN) models and energy-efficient hardware design are of great importance in edge-computing applications. The neural architecture search (NAS) methods are employed for CNN model optimization with multi-precision networks. To satisfy the computation requirements, multi-precision convolution accelerators are highly desired. The existing high-precision-split (HPS) designs reduce the additional logics for reconfiguration while resulting in low throughput for low precisions. The low-precision-combination (LPC) designs improve the low-precision throughput with large hardware cost. In this work, a bit-split-and-combination (BSC) systolic accelerator is proposed to overcome the bottlenecks. Firstly, BSC-based multiply-accumulate (MAC) unit is designed to support multi-precision computation operations. Secondly, multi-precision systolic dataflow is developed with improved data-reuse and transmission efficiency. The proposed work is designed by Chisel and synthesized in 28-nm process. The BSC MAC unit achieves maximum 2.40× and 1.64× energy efficiency than HPS and LPC units, respectively. Compared with published accelerator designs Gemmini, Bit-fusion and Bit-serial, the proposed accelerator achieves up to 2.94× area efficiency and 6.38× energy-saving performance on the multi-precision VGG16, ResNet-18 and LeNet-5 benchmarks.

Read more