Abstract

Optimized model and energy-efficient hardware are both required for deep neural networks (DNNs) in edge-computing area. Neural architecture search (NAS) methods are employed for DNN model optimization with resulted multi-precision networks. Previous works have proposed low-precision-combination (LPC) and high-precision-split (HPS) methods for multi-precision networks, which are not energy-efficient for precision-scalable vector implementation. In this paper, a BSC-based vector systolic accelerator is developed for a precision-scalable energy-efficient convolution on edge. The maximum energy efficiency of the proposed BSC vector processing element (PE) is up to 1.95× higher in 2-bit, 4-bit and 8-bit operations when compared with LPC and HPS PEs. Further with NAS optimized multi-precision CNN networks, the averaged energy efficiency of the proposed vector systolic BSC PE array achieves up to 2.18× higher in 2bit, 4-bit and 8-bit operations than that of LPC and HPS PE arrays.

An Energy-Efﬁcient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks

L. Dai, Q. Cheng, Y. Wang, G. Huang, J. Zhou, K. Li, W. Mao, H. Yu

Design Automation Conference in Asia and South Pacific Region (ASP-DAC)

PAPER / SILDES / VIDEO

Abstract

Optimized convolutional neural network (CNN) models and energy-efﬁcient hardware design are of great importance in edge-computing applications. The neural architecture search (NAS) methods are employed for CNN model optimization with multi-precision networks. To satisfy the computation requirements, multi-precision convolution accelerators are highly desired. The existing high-precision-split (HPS) designs reduce the additional logics for reconﬁguration while resulting in low throughput for low precisions. The low-precision-combination (LPC) designs improve the low-precision throughput with large hardware cost. In this work, a bit-split-and-combination (BSC) systolic accelerator is proposed to overcome the bottlenecks. Firstly, BSC-based multiply-accumulate (MAC) unit is designed to support multi-precision computation operations. Secondly, multi-precision systolic dataﬂow is developed with improved data-reuse and transmission efﬁciency. The proposed work is designed by Chisel and synthesized in 28-nm process. The BSC MAC unit achieves maximum 2.40× and 1.64× energy efﬁciency than HPS and LPC units, respectively. Compared with published accelerator designs Gemmini, Bit-fusion and Bit-serial, the proposed accelerator achieves up to 2.94× area efﬁciency and 6.38× energy-saving performance on the multi-precision VGG16, ResNet-18 and LeNet-5 benchmarks.

A Precision-Scalable Energy-Efficient Bit-Split-and-Combination Vector Systolic Accelerator for NAS-Optimized DNNs on Edge

Abstract

An Energy-Efﬁcient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks

Abstract

Publications

Tags