FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

High-Throughput Convolutional Neural Network
on an FPGA by Customized JPEG Compression
Hiroki Nakahara
Tokyo Institute of Technology, JP
Zhiqiang Que Wayne Luk
Imperial College London, UK

Outline
• Background
• JPEG compression for a high-speed inference
• CNN model for an FPGA implementation
• Channel shift and point-wise decomposition
• Quantization strategy
• Channel shuffle
• Fully-pipelined CNN architecture
• Experimental results
• Conclusion
2

Convolutional Neural Networks (CNNs)
• High accuracy and many applications
• Image recognitions, NLPs, data mining [1]
• FPGAs on cloud services
• Amazon AWS, Microsoft Azure, etc.
3
[1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng,
“UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery
and data mining (KDD), 2019, pp.3132–3142.

Problems
• Power consumption
• Performance bottleneck (Data-transfer)
• e.g., AWS F1 provides overall read/write at 6.5GB/s
from host CPU to FPGA [1]
4
Host
PC
Interconnect
PCIe
CNN
Kernel
.jpg
RAW(RGB) Img.
Accelerator Card
[1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for
Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.

Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
5
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.

Dog?
Cat?
6Image Source: https://www.kaggle.com/c/dogs-vs-cats/data

7
Labradoodle?
Fried Chicken?
Source: https://bit.ly/2zveHGT

Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
9
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.

Customized JPEG
for a High-speed Inference
10

JPEG Coding
11
Pre-
processing
DCT Quant.
Huffman
Encoding
Quant.
Table
Huffman
Coding
Table
Post-
processing
IDCT
Reverse
Quant.
Huffman
Decoding
Encoding
Decoding
CompressedImageData
RGB
Image
Picture
Matrix
DCT
Matrix
JPEG
Header

Proposed JPEG Coding
with a CNN Accelerator
12
Quant.
Huffman
Encoding
Fully
Pipelining
CNN
IDCT
Reverse
Quant.
Huffman
Decoding
Host PC
ImageStreamData
JPEG
Image
.jpg
Extreme
Quant. Value q
Quant.
Table
RAM
PCIe
Huffman
Decoding
& Reverse
Quant.
RAMRAM
Detection
Result
FPGA
Ping-pong
Buffer
Huffman
Coding Table
Huffman
Coding Table

Huffman Decoding
and Reverse Quantization Unit
13
0
1
2
3
4
2
2
2
3
4
Shift Register
Shift
Value
Quantized
Value
Quant.
Value q
Run-length
Decoder
00**
01**
10**
110*
1110
Image Data Stream
...
...
Priority
Encoder
Buffer RAM
Zig-zag writing
ADR
WDATA
Zig-zag
pattern ROM

• Decompose the 2D-IDCT with 16 1D-DCTs
14
2D-IDCT
AP-922
Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine
Transform Using the Streaming SIMD Extensions and MMX Instructions,”
https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf

2D-IDCT Unit
15
..
Controller
Operation
Units
Reg. 1D-IDCT Unit
RAM RAM RAM
• Two 1D-IDCT units
• Use half precision (16 bits)

CNN model for an FPGA
Implementation
16

Overview
1. Decomposing k×k convolution by
channel shift [1] and point-wise (1×1) convolution
2. Binary (1-bit) weight quantization [2]
3. Channel split and shuffle [3]
17
[3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices,” CVPR, 2018.
[1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez,
and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,”
CVPR, 2018, pp. 9127-9135.
[2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural
Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.

Building Blocks
18
#channel x2
(a) Plain block
(b) Down-sampling block
Channel
Split
Shift PWConv Shift PWConv
Concat
&
Shuffle
Channel
Split
Shift
PWConv
(s=2)
Shift PWConv
Concat
&
Shuffle
Shift
PWConv
(s=2)
#channel/2

Our CNN
Model
19
Layer Output
size
Kernel size Stride #Output
channel
Image 224 3
PWConv 224 1 2 24
Norm 224 1 1 24
Shift 224 1 1 24
Pool 112 3 1 24
PWConv 112 2 2 24
Norm 112 1 1 24
ReLU 112 1 1 24
Shift 112 3 1 24
Pool 56 2 2 24
Stage 2 28 116
(4 repeats)
Stage 3 14 232
(8 repeats)
Stage 4 7 464
(16 repeats)
GAP 1 7 1 464
PWConv 1 1 1 1000
• Training-aware
quantization
• w: binary, a: 8-bit
• 2.54 M params,
0.616 GMACs

Fully-pipelined
CNN Architecture
20

Dataflow for a Residual Stage
of a Plain Block
• Double buffers for branch-flow
• Xilinx #pragma HLS dataflow
21
Layer
Unit
F.map Buffer
...
...
...
...
Layer
Unit
...
...
Shuffle
...

2D Convolutional Unit
22
...
...
AdderTree
BN Act
W.mem
...
...
...
...
c
n
p
c
n×p
Convolution Unit

Pooling Units
23
x00 x01 x02 x03 x04
x10 x11 x12 x13 x14
x20 x21 x22 x23 x24
x30 x31 x32 x33 x34
x40 x41 x42 x43 x44
x11 x10 x04 x03 x02 x01 x00
Write
Ctrl.
Logic
F. Map Mem. (n=5, k=2)
Shift Register
Max
Selector
+F. Map Mem.
Register
Reset
Write
Ctrl.
Logic
1
𝑛!
Controller
Max. Pooling
Unit
Global Ave.
Pooling
Unit

Compression Ratio vs. Accuracy
25
162.2
124.6
82.1
53.5
34.9
11.5
59.61
66.64
70.8 71.1 71.2 71.2
50
55
60
65
70
75
0.0
50.0
100.0
150.0
200.0
q=1 q=2 q=3 q=4 q=5 Standard
speed-up acc
ImageNetTop-1Accuracy[%]
DataTransferSpeed-UpRatio
(Baseline:RGBImageTransfer)
JPEG Quantization Bit
• ImageNet2012 (224x224 pixel image) classification task
• PyTorch 1.4.0 + modified libjpeg library
Only decreases 0.3 point of accuracy and
achieves 82.1 times speed-up

Implementation Results
Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs
JPEG Decoder 11,675 6,646 34 2 0
Huffman Decoder 6,794 2,378 0 0 0
2D-IDCT 4,881 4,278 34 2 0
Pipelined-CNN 263,120 266,784 2,336 2,744 0
Total 274,795 273,440 2,370 2,746 16
(Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%)
26
• Xilinx Inc. Virtex UltraScale+ FPGA
VCU1525 acceleration development kit
• Xilinx Inc. SDAccel 2018.2
• Operates 300MHz@75Watt
• System performance: 3321.25 FPS
• JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS)
• JPEG decoder part of the LUT was only 4.2% of total system resource

Comparison with
Other FPGA Implementations
27
Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours
FPGA Stratix V Zynq
ZU3EG
Zynq
ZU3EG
Zynq ZU9EG Virtex US+
XCVU9P
Virtex US+
XCVU9P
FPS 864.7 200.0 96.5 809.8 123.1 3321.2
Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8%
Top-5 Acc. 66.80% --- 88.12% --- --- 90.1%
Precision
(W/Act)
16/16 1/2 4/4 8/8 16/16 1/8
Freq.(MHz) 150 220 250 333 214 300
Power (W) 26.2 10.2 5.5 --- 49.25 75.0
1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018.
2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast
exploration of quantized neural networks,” 2018.
3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy:
Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019.
4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th
International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143.
5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019,
pp.73–82.

Comparison with CPU and GPU
Platform CPU GPU FPGA
Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P
Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz
Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM
Throughput (FPS) 24.0 350.0 3321.25
Power (W) 95 295 75
Efficiency (FPS/W) 0.25 1.18 44.28
28
• Ubuntu 18.04 LTS with PyTorch 1.4.0
• 128 Batch with INT8 quantization (for CPU and GPU)
Note: CPU and GPU did not use our JPEG compression scheme

Conclusion
• Customized JPEG compression for a high-speed inference
• 82.1x speed-up, 0.3-point accuracy drop
• CNN model for a fully-pipelined implementation
• Channel shift and point-wise decomposition
• Binary weight quantization
• Channel split-shuffle operation
• Fully-pipelined CNN architecture
• Achieved 3,321 FPS@75W
• Speed-up: 138.4x CPU, 9.5x GPU
• Energy efficiency: 177.1x CPU, 37.5x GPU
• Future works
• Custom compression & Other DL applications
30

Thank you
Hiroki Nakahara (Tokyo Tech, JP)
nakahara@ict.e.titech.ac.jp
31

FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

Similar to FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression (20)

More from Hiroki Nakahara

More from Hiroki Nakahara (20)

Recently uploaded

Recently uploaded (20)

FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression