A Vector Systolic Accelerator for Multi-Precision Floating-Point High-Performance Computing

K. Li, J. Zhou, B. Li, S. Yang, S. Huang, H. Yu

TCAS-II

PAPER


Abstract

There is an emerging need to design multi-precision floating-point (FP) accelerators for high-performance-computing (HPC) applications. However, the existing multi-precision design using high-precision-split method and low-precision-combination method suffers either low hardware utilization rate and long multiple clock-cycle processing period, respectively. In this paper, a new pipelined multi-precision FP processing element (PE) is developed with proposed redundancy-minimized bit-partitioning method. 3.8× throughput improving is achieved by the elaborate designed pipeline. Compared with the existing multi-precision FPPE method, this work achieves 11.7%, 6.7%, 62.6% enhancement on energy-efficiency at FP16, FP32 and FP64 operations, respectively. Moreover, to further improve the system-level throughput and energy efficiency, a vector systolic accelerator is employed. Benefit from the pipelined vector FP-PE and vector systolic date reuse, the proposed accelerator exhibits the best energy-efficiency performance of 1193 GFLOPS/W at FP16, 298.3 GFLOPS/W at FP32 and 74.6 GFLOPS/W at FP64.

Paper

A Precision-Scalable Energy-Efficient Bit-Split-and-Combination Vector Systolic Accelerator for NAS-Optimized DNNs on Edge

K. Li, J. Zhou, Y. Wang, J. Luo, Z. Yang, S. Yang, W. Mao, M. Huang, H. Yu

Design, Automation and Test in Europe Conference (DATE)

PAPER / SILDES / VIDEO / SHORT


Abstract

Optimized model and energy-efficient hardware are both required for deep neural networks (DNNs) in edge-computing area. Neural architecture search (NAS) methods are employed for DNN model optimization with resulted multi-precision networks. Previous works have proposed low-precision-combination (LPC) and high-precision-split (HPS) methods for multi-precision networks, which are not energy-efficient for precision-scalable vector implementation. In this paper, a BSC-based vector systolic accelerator is developed for a precision-scalable energy-efficient convolution on edge. The maximum energy efficiency of the proposed BSC vector processing element (PE) is up to 1.95× higher in 2-bit, 4-bit and 8-bit operations when compared with LPC and HPS PEs. Further with NAS optimized multi-precision CNN networks, the averaged energy efficiency of the proposed vector systolic BSC PE array achieves up to 2.18× higher in 2bit, 4-bit and 8-bit operations than that of LPC and HPS PE arrays.

Read more

An Energy-Efficient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks

L. Dai, Q. Cheng, Y. Wang, G. Huang, J. Zhou, K. Li, W. Mao, H. Yu

Design Automation Conference in Asia and South Pacific Region (ASP-DAC)

PAPER / SILDES / VIDEO


Abstract

Optimized convolutional neural network (CNN) models and energy-efficient hardware design are of great importance in edge-computing applications. The neural architecture search (NAS) methods are employed for CNN model optimization with multi-precision networks. To satisfy the computation requirements, multi-precision convolution accelerators are highly desired. The existing high-precision-split (HPS) designs reduce the additional logics for reconfiguration while resulting in low throughput for low precisions. The low-precision-combination (LPC) designs improve the low-precision throughput with large hardware cost. In this work, a bit-split-and-combination (BSC) systolic accelerator is proposed to overcome the bottlenecks. Firstly, BSC-based multiply-accumulate (MAC) unit is designed to support multi-precision computation operations. Secondly, multi-precision systolic dataflow is developed with improved data-reuse and transmission efficiency. The proposed work is designed by Chisel and synthesized in 28-nm process. The BSC MAC unit achieves maximum 2.40× and 1.64× energy efficiency than HPS and LPC units, respectively. Compared with published accelerator designs Gemmini, Bit-fusion and Bit-serial, the proposed accelerator achieves up to 2.94× area efficiency and 6.38× energy-saving performance on the multi-precision VGG16, ResNet-18 and LeNet-5 benchmarks.

Read more