SlideShare a Scribd company logo
High-Throughput Convolutional Neural Network
on an FPGA by Customized JPEG Compression
Hiroki Nakahara
Tokyo Institute of Technology, JP
Zhiqiang Que Wayne Luk
Imperial College London, UK
Outline
• Background
• JPEG compression for a high-speed inference
• CNN model for an FPGA implementation
• Channel shift and point-wise decomposition
• Quantization strategy
• Channel shuffle
• Fully-pipelined CNN architecture
• Experimental results
• Conclusion
2
Convolutional Neural Networks (CNNs)
• High accuracy and many applications
• Image recognitions, NLPs, data mining [1]
• FPGAs on cloud services
• Amazon AWS, Microsoft Azure, etc.
3
[1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng,
“UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery
and data mining (KDD), 2019, pp.3132–3142.
Problems
• Power consumption
• Performance bottleneck (Data-transfer)
• e.g., AWS F1 provides overall read/write at 6.5GB/s
from host CPU to FPGA [1]
4
Host
PC
Interconnect
PCIe
CNN
Kernel
.jpg
RAW(RGB) Img.
Accelerator Card
[1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for
Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.

Recommended for you

Design challenges in embedded systems
Design challenges in embedded systemsDesign challenges in embedded systems
Design challenges in embedded systems

Challenges faced during embedded system design: The challenges in design of embedded systems have always been in the same limiting requirements for decades: Small form factor; Low energy; Long-term stable performance without maintenance.

5
Basic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDLBasic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDL

ppt : Basic of AI Accelerator Design using Verilog HDL git : https://github.com/matbi86/01_ai_accelerator_basic_for_student ref : http://eyeriss.mit.edu/tutorial.html

basicaiaccelerator
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm

The document provides an overview of the ARM architecture and Cortex-M3 processor. It discusses ARM Ltd.'s history and business model as an IP licensing company. It then describes the Cortex-M3 microcontroller, including its programmer's model, exception and interrupt handling, pipeline, and instruction sets. Key points are the Cortex-M3's stack-based exception model, 3-stage pipeline, conditional execution support, and AHB/APB system design integration.

a
Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
5
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.
Dog?
Cat?
6Image Source: https://www.kaggle.com/c/dogs-vs-cats/data
7
Labradoodle?
Fried Chicken?
Source: https://bit.ly/2zveHGT
8
Labradoodle?
Fried Chicken?

Recommended for you

DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...

This document describes the design and simulation of different 8-bit multipliers using Verilog code. It summarizes four multipliers: array, Wallace tree, Baugh-Wooley, and Vedic. It finds that the Baugh-Wooley multiplier has advantages in speed, delay, area, complexity, and power consumption compared to the other multipliers. The document also discusses half adders, full adders, ripple carry adders, carry save adders, and multiplication algorithms. It aims to compare the multipliers based on area, speed, and delay.

design and simulation of different 8-bit multipliearray multiplierwallace tree multipler
5 g nr (new radio)overview
5 g nr (new radio)overview5 g nr (new radio)overview
5 g nr (new radio)overview

5G-NR (New Radio) is the 5G wireless standard developed by 3GPP to support both sub-6 GHz and mmWave spectrum. It supports three main use cases - enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (URLLC). 5G-NR can operate in both non-standalone and standalone modes, with non-standalone relying on the existing 4G LTE network for core functionality and standalone operating independently. Key 5G technologies include higher peak data rates up to 20 Gbps, lower latency around 1 ms, support for high mobility up to 500 km/h, and ability to connect a massive number of devices

Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)

The SPI (Serial Peripheral Interface) is a synchronous serial communication protocol used for communication between devices. It uses a master-slave architecture with a single master device initiating data transfer. Key features include using separate clock and data lines, operating in full duplex mode, and allowing multiple slave devices through individual chip selects. It provides a lower pin count solution than parallel buses at the cost of slower communication speeds.

spi
Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
9
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.
Customized JPEG
for a High-speed Inference
10
JPEG Coding
11
Pre-
processing
DCT Quant.
Huffman
Encoding
Quant.
Table
Huffman
Coding
Table
Post-
processing
IDCT
Reverse
Quant.
Huffman
Decoding
Encoding
Decoding
CompressedImageData
RGB
Image
Picture
Matrix
DCT
Matrix
JPEG
Header
Proposed JPEG Coding
with a CNN Accelerator
12
Quant.
Huffman
Encoding
Fully
Pipelining
CNN
IDCT
Reverse
Quant.
Huffman
Decoding
Host PC
ImageStreamData
JPEG
Image
.jpg
Extreme
Quant. Value q
Quant.
Table
RAM
PCIe
Huffman
Decoding
& Reverse
Quant.
RAMRAM
Detection
Result
FPGA
Ping-pong
Buffer
Huffman
Coding Table
Huffman
Coding Table

Recommended for you

Placement and algorithm.
Placement and algorithm.Placement and algorithm.
Placement and algorithm.

Placement is the process of determining the locations of circuit devices on a chip. It is a critical step that affects performance, routability, heat distribution, and power consumption. There are different types of placement like standard cell placement and building block placement. Placement algorithms aim to optimize objectives like minimizing total area and wire length. Simulated annealing is a commonly used iterative placement algorithm that models the physical annealing process to arrive at a low-cost solution. Other algorithms include partitioning-based approaches and cluster growth.

VLSI Testing Techniques
VLSI Testing TechniquesVLSI Testing Techniques
VLSI Testing Techniques

This document discusses various VLSI testing techniques. It begins by explaining the need for testing circuits when they are first developed and manufactured to check that they meet specifications. The main testing approach is to apply test inputs and compare the outputs to expected patterns. It then describes different testing techniques for combinational and sequential circuits, including fault modeling, path sensitizing, scan path testing, built-in self-test (BIST), boundary scan testing, and signature analysis. Specific circuit examples are provided to illustrate scan path testing, BIST using linear feedback shift registers (LFSRs) and compressor circuits, and boundary scan testing.

vlsivlsi testingtesting of combinational circuit
LTE-Advanced Physical Layer
LTE-Advanced Physical LayerLTE-Advanced Physical Layer
LTE-Advanced Physical Layer

This document summarizes the physical layer design of LTE Release 8 and enhancements for LTE-Advanced. It describes the downlink and uplink multiple access schemes, reference signals, control signaling, data transmission procedures, UE categories, and support for frequency division duplex and time division duplex operation. The document provides an overview of the 3GPP release timeline and the specifications that define the LTE physical layer.

Huffman Decoding
and Reverse Quantization Unit
13
0
1
2
3
4
2
2
2
3
4
Shift Register
Shift
Value
Quantized
Value
Quant.
Value q
Run-length
Decoder
00**
01**
10**
110*
1110
Image Data Stream
...
...
Priority
Encoder
Buffer RAM
Zig-zag writing
ADR
WDATA
Zig-zag
pattern ROM
• Decompose the 2D-IDCT with 16 1D-DCTs
14
2D-IDCT
AP-922
Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine
Transform Using the Streaming SIMD Extensions and MMX Instructions,”
https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf
2D-IDCT Unit
15
..
Controller
Operation
Units
Reg. 1D-IDCT Unit
RAM RAM RAM
• Two 1D-IDCT units
• Use half precision (16 bits)
CNN model for an FPGA
Implementation
16

Recommended for you

ARM CORTEX M3 PPT
ARM CORTEX M3 PPTARM CORTEX M3 PPT
ARM CORTEX M3 PPT

This Presentation describes the ARM CORTEX M3 core processor with the details of the core peripherals. Soon a CORTEX base controller(STM32F100RBT6) ppt will be uploaded. For more information mail me at:gaurav.iitkg@gmail.com.

nvicbit bandingtail chaining
C Programming For Embedded Systems
C Programming For Embedded SystemsC Programming For Embedded Systems
C Programming For Embedded Systems

This document provides an overview of C programming for embedded systems. It discusses how embedded programming differs from general programming, focusing on resource constraints, hardware differences, and lack of debugging tools in embedded systems. It also covers how C is commonly used for embedded programming, emphasizing static memory allocation, inline assembly, and avoiding complex features. Finally, it introduces the GCC toolchain for compiling C code for embedded devices.

8086
80868086
8086

The 8086 microprocessor is a 16-bit CPU launched by Intel in 1978. It has a 16-bit data bus and 20-bit address bus, allowing it to access up to 1MB of memory. The 8086 architecture partitions the CPU logic into two functional units - the Bus Interface Unit which handles external transactions, and the Execution Unit which performs decoding and execution. This separation improves processing speed by allowing parallel instruction fetching and execution via pipelining. The 8086 uses memory segmentation to access more memory than its 16-bit registers allow, dividing the 1MB address space into 64KB segments addressed using segment and offset registers.

Overview
1. Decomposing k×k convolution by
channel shift [1] and point-wise (1×1) convolution
2. Binary (1-bit) weight quantization [2]
3. Channel split and shuffle [3]
17
[3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices,” CVPR, 2018.
[1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez,
and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,”
CVPR, 2018, pp. 9127-9135.
[2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural
Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.
Building Blocks
18
#channel x2
(a) Plain block
(b) Down-sampling block
Channel
Split
Shift PWConv Shift PWConv
Concat
&
Shuffle
Channel
Split
Shift
PWConv
(s=2)
Shift PWConv
Concat
&
Shuffle
Shift
PWConv
(s=2)
#channel/2
Our CNN
Model
19
Layer Output
size
Kernel size Stride #Output
channel
Image 224 3
PWConv 224 1 2 24
Norm 224 1 1 24
Shift 224 1 1 24
Pool 112 3 1 24
PWConv 112 2 2 24
Norm 112 1 1 24
ReLU 112 1 1 24
Shift 112 3 1 24
Pool 56 2 2 24
Stage 2 28 116
(4 repeats)
Stage 3 14 232
(8 repeats)
Stage 4 7 464
(16 repeats)
GAP 1 7 1 464
PWConv 1 1 1 1000
• Training-aware
quantization
• w: binary, a: 8-bit
• 2.54 M params,
0.616 GMACs
Fully-pipelined
CNN Architecture
20

Recommended for you

Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086

The document discusses the instruction set of the 8086 microprocessor. It describes that the 8086 has over 20,000 instructions that are classified into several categories like data transfer, arithmetic, bit manipulation, program execution transfer, and string instructions. Under each category, it provides details about specific instructions like MOV, ADD, AND, CALL, etc. and explains their functionality and operand usage.

Interfacing of ADC 0808
Interfacing of ADC 0808Interfacing of ADC 0808
Interfacing of ADC 0808

Here we can find the pin diagram of ADC 0808 and how to interface it with 8086 microprocessor using 8255 PPI

Concurrent programming with RTOS
Concurrent programming with RTOSConcurrent programming with RTOS
Concurrent programming with RTOS

This document discusses concurrent programming with real-time operating systems (RTOS). It begins with an overview of RTOS and what they provide to programmers, such as task management, synchronization primitives, and driver packages. It then discusses specific RTOS concepts like tasks, concurrency primitives like semaphores, and common concurrency problems like data races. Examples are given to demonstrate task creation and using semaphores to safely increment a shared variable between tasks. The document concludes with discussing classical concurrency problems like the dining philosophers problem and potential issues that could arise like deadlock or starvation.

rtoskernel
Dataflow for a Residual Stage
of a Plain Block
• Double buffers for branch-flow
• Xilinx #pragma HLS dataflow
21
Layer
Unit
F.map Buffer
...
...
...
...
Layer
Unit
...
...
Shuffle
...
2D Convolutional Unit
22
...
...
AdderTree
BN Act
W.mem
...
...
...
...
c
n
p
c
n×p
Convolution Unit
Pooling Units
23
x00 x01 x02 x03 x04
x10 x11 x12 x13 x14
x20 x21 x22 x23 x24
x30 x31 x32 x33 x34
x40 x41 x42 x43 x44
x11 x10 x04 x03 x02 x01 x00
Write
Ctrl.
Logic
F. Map Mem. (n=5, k=2)
Shift Register
Max
Selector
+F. Map Mem.
Register
Reset
Write
Ctrl.
Logic
1
𝑛!
Controller
Max. Pooling
Unit
Global Ave.
Pooling
Unit
Experimental Results
24

Recommended for you

Turbo codes.ppt
Turbo codes.pptTurbo codes.ppt
Turbo codes.ppt

Turbo codes are a type of error correcting code that can achieve performance close to the theoretical maximum allowed by Shannon's limit. Turbo codes use an iterative decoding process between two recursive systematic convolutional encoders separated by an interleaver. This iterative decoding allows turbo codes to correct errors very efficiently. Turbo codes are used in applications like deep space communications and mobile phone networks due to their ability to operate reliably at low signal-to-noise ratios.

Low Power Design and Verification
Low Power Design and VerificationLow Power Design and Verification
Low Power Design and Verification

1. Different tools use different descriptions for power management, making it difficult to verify configurations and keep definitions consistent across the design flow. 2. There is no automation for verifying power management definitions, requiring designers to manually verify thousands of statements. 3. The design hierarchy and syntax varies between tools and between RTL and gate representations, complicating cross-checking. 4. It is challenging to verify power functionality without changing RTL code since power and ground nets are not explicitly captured or simulated.

Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab

This document discusses deep learning initiatives at NECSTLab focused on hardware acceleration of convolutional neural networks using FPGAs. It proposes a framework called CNNECST that provides high-level APIs to design CNNs, integrates with machine learning frameworks for training, and generates customized hardware for FPGA implementation through C++ libraries and Vivado. Experimental results show speedups and energy savings for CNNs like LeNet and MNIST on FPGA boards compared to CPU. Challenges and future work include supporting more layer types and reduced precision computations.

#polimi#necstlab#ngc17
Compression Ratio vs. Accuracy
25
162.2
124.6
82.1
53.5
34.9
11.5
59.61
66.64
70.8 71.1 71.2 71.2
50
55
60
65
70
75
0.0
50.0
100.0
150.0
200.0
q=1 q=2 q=3 q=4 q=5 Standard
speed-up acc
ImageNetTop-1Accuracy[%]
DataTransferSpeed-UpRatio
(Baseline:RGBImageTransfer)
JPEG Quantization Bit
• ImageNet2012 (224x224 pixel image) classification task
• PyTorch 1.4.0 + modified libjpeg library
Only decreases 0.3 point of accuracy and
achieves 82.1 times speed-up
Implementation Results
Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs
JPEG Decoder 11,675 6,646 34 2 0
Huffman Decoder 6,794 2,378 0 0 0
2D-IDCT 4,881 4,278 34 2 0
Pipelined-CNN 263,120 266,784 2,336 2,744 0
Total 274,795 273,440 2,370 2,746 16
(Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%)
26
• Xilinx Inc. Virtex UltraScale+ FPGA
VCU1525 acceleration development kit
• Xilinx Inc. SDAccel 2018.2
• Operates 300MHz@75Watt
• System performance: 3321.25 FPS
• JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS)
• JPEG decoder part of the LUT was only 4.2% of total system resource
Comparison with
Other FPGA Implementations
27
Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours
FPGA Stratix V Zynq
ZU3EG
Zynq
ZU3EG
Zynq ZU9EG Virtex US+
XCVU9P
Virtex US+
XCVU9P
FPS 864.7 200.0 96.5 809.8 123.1 3321.2
Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8%
Top-5 Acc. 66.80% --- 88.12% --- --- 90.1%
Precision
(W/Act)
16/16 1/2 4/4 8/8 16/16 1/8
Freq.(MHz) 150 220 250 333 214 300
Power (W) 26.2 10.2 5.5 --- 49.25 75.0
1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018.
2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast
exploration of quantized neural networks,” 2018.
3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy:
Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019.
4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th
International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143.
5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019,
pp.73–82.
Comparison with CPU and GPU
Platform CPU GPU FPGA
Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P
Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz
Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM
Throughput (FPS) 24.0 350.0 3321.25
Power (W) 95 295 75
Efficiency (FPS/W) 0.25 1.18 44.28
28
• Ubuntu 18.04 LTS with PyTorch 1.4.0
• 128 Batch with INT8 quantization (for CPU and GPU)
Note: CPU and GPU did not use our JPEG compression scheme

Recommended for you

Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalDp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final

Optimize Convolutional Neural Network by optimizing algorithm and improving with parallelization with OpenMP and OpenMPI

machine learningbigdata
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands

Speech recognition is one of the key topics in artificial intelligence, as it is one of the most common forms of communication in humans. Researchers have developed many speech-controlled prosthetic hands in the past decades, utilizing conventional speech recognition systems that use a combination of neural network and hidden Markov model. Recent advancements in general-purpose graphics processing units (GPGPUs) enable intelligent devices to run deep neural networks in real-time. Thus, state-of-the-art speech recognition systems have rapidly shifted from the paradigm of composite subsystems optimization to the paradigm of end-to-end optimization. However, a low-power embedded GPGPU cannot run these speech recognition systems in real-time. In this paper, we show the development of deep convolutional neural networks (CNN) for speech control of prosthetic hands that run in real-time on a NVIDIA Jetson TX2 developer kit. First, the device captures and converts speech into 2D features (like spectrogram). The CNN receives the 2D features and classifies the hand gestures. Finally, the hand gesture classes are sent to the prosthetic hand motion control system. The whole system is written in Python with Keras, a deep learning library that has a TensorFlow backend. Our experiments on the CNN demonstrate the 91% accuracy and 2ms running time of hand gestures (text output) from speech commands, which can be used to control the prosthetic hands in real-time. 2019 First International Conference on Transdisciplinary AI (TransAI), Laguna Hills, California, USA, 2019, pp. 35-42

convolutional neural networksdeep learningspeech recognition
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...

This document summarizes an adaptive modular approach for mining sensor network data using machine learning techniques. It presents a two-layer architecture that uses an online compression algorithm (PCA) in the first layer to reduce data dimensionality and an adaptive lazy learning algorithm (KNN) in the second layer for prediction and regression tasks. Simulation results on a wave propagation dataset show the approach can handle non-stationarities like concept drift, sensor failures and network changes in an efficient and adaptive manner.

pattern recognitionmachine learning
Conclusion
29
Conclusion
• Customized JPEG compression for a high-speed inference
• 82.1x speed-up, 0.3-point accuracy drop
• CNN model for a fully-pipelined implementation
• Channel shift and point-wise decomposition
• Binary weight quantization
• Channel split-shuffle operation
• Fully-pipelined CNN architecture
• Achieved 3,321 FPS@75W
• Speed-up: 138.4x CPU, 9.5x GPU
• Energy efficiency: 177.1x CPU, 37.5x GPU
• Future works
• Custom compression & Other DL applications
30
Thank you
Hiroki Nakahara (Tokyo Tech, JP)
nakahara@ict.e.titech.ac.jp
31

More Related Content

What's hot

Xilinx Cool Runner Architecture
Xilinx Cool Runner ArchitectureXilinx Cool Runner Architecture
Xilinx Cool Runner Architecture
dragonpradeep
 
Unit vi (2)
Unit vi (2)Unit vi (2)
Unit vi (2)
Siva Nageswararao
 
Analog to Digital converter in ARM
Analog to Digital converter in ARMAnalog to Digital converter in ARM
Analog to Digital converter in ARM
Aarav Soni
 
Design challenges in embedded systems
Design challenges in embedded systemsDesign challenges in embedded systems
Design challenges in embedded systems
mahalakshmimalini
 
Basic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDLBasic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDL
Joohan KIM
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
Prashant Ahire
 
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
Saikiran Panjala
 
5 g nr (new radio)overview
5 g nr (new radio)overview5 g nr (new radio)overview
5 g nr (new radio)overview
Braj Kishor
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)
Dhaval Kaneria
 
Placement and algorithm.
Placement and algorithm.Placement and algorithm.
Placement and algorithm.
Ashish Singh
 
VLSI Testing Techniques
VLSI Testing TechniquesVLSI Testing Techniques
VLSI Testing Techniques
A B Shinde
 
LTE-Advanced Physical Layer
LTE-Advanced Physical LayerLTE-Advanced Physical Layer
LTE-Advanced Physical Layer
Praveen Kumar
 
ARM CORTEX M3 PPT
ARM CORTEX M3 PPTARM CORTEX M3 PPT
ARM CORTEX M3 PPT
Gaurav Verma
 
C Programming For Embedded Systems
C Programming For Embedded SystemsC Programming For Embedded Systems
C Programming For Embedded Systems
Ganesh Samarthyam
 
8086
80868086
Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086
9840596838
 
Interfacing of ADC 0808
Interfacing of ADC 0808Interfacing of ADC 0808
Interfacing of ADC 0808
Akkenaguntla Karthik
 
Concurrent programming with RTOS
Concurrent programming with RTOSConcurrent programming with RTOS
Concurrent programming with RTOS
Sirin Software
 
Turbo codes.ppt
Turbo codes.pptTurbo codes.ppt
Turbo codes.ppt
Prasant Barik
 
Low Power Design and Verification
Low Power Design and VerificationLow Power Design and Verification
Low Power Design and Verification
DVClub
 

What's hot (20)

Xilinx Cool Runner Architecture
Xilinx Cool Runner ArchitectureXilinx Cool Runner Architecture
Xilinx Cool Runner Architecture
 
Unit vi (2)
Unit vi (2)Unit vi (2)
Unit vi (2)
 
Analog to Digital converter in ARM
Analog to Digital converter in ARMAnalog to Digital converter in ARM
Analog to Digital converter in ARM
 
Design challenges in embedded systems
Design challenges in embedded systemsDesign challenges in embedded systems
Design challenges in embedded systems
 
Basic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDLBasic of AI Accelerator Design using Verilog HDL
Basic of AI Accelerator Design using Verilog HDL
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
 
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
 
5 g nr (new radio)overview
5 g nr (new radio)overview5 g nr (new radio)overview
5 g nr (new radio)overview
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)
 
Placement and algorithm.
Placement and algorithm.Placement and algorithm.
Placement and algorithm.
 
VLSI Testing Techniques
VLSI Testing TechniquesVLSI Testing Techniques
VLSI Testing Techniques
 
LTE-Advanced Physical Layer
LTE-Advanced Physical LayerLTE-Advanced Physical Layer
LTE-Advanced Physical Layer
 
ARM CORTEX M3 PPT
ARM CORTEX M3 PPTARM CORTEX M3 PPT
ARM CORTEX M3 PPT
 
C Programming For Embedded Systems
C Programming For Embedded SystemsC Programming For Embedded Systems
C Programming For Embedded Systems
 
8086
80868086
8086
 
Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086
 
Interfacing of ADC 0808
Interfacing of ADC 0808Interfacing of ADC 0808
Interfacing of ADC 0808
 
Concurrent programming with RTOS
Concurrent programming with RTOSConcurrent programming with RTOS
Concurrent programming with RTOS
 
Turbo codes.ppt
Turbo codes.pptTurbo codes.ppt
Turbo codes.ppt
 
Low Power Design and Verification
Low Power Design and VerificationLow Power Design and Verification
Low Power Design and Verification
 

Similar to FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
NECST Lab @ Politecnico di Milano
 
Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalDp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final
Bikramjit Chowdhury
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
Mohsen Jafarzadeh
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
butest
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
Larry Smarr
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
deawoo Kim
 
EIS_REVIEW_1.pptx
EIS_REVIEW_1.pptxEIS_REVIEW_1.pptx
EIS_REVIEW_1.pptx
01fe20bec143
 
Lifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobileLifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobile
Nexgen Technology
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
gerogepatton
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
gerogepatton
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
TELKOMNIKA JOURNAL
 
Online opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learningOnline opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learning
Harshal Solao
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
Ian Foster
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
Larry Smarr
 
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
thanhdowork
 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020
RohanLekhwani
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
LEGATO project
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chip
zhao fu
 
An35225228
An35225228An35225228
An35225228
IJERA Editor
 

Similar to FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression (20)

Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalDp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
 
EIS_REVIEW_1.pptx
EIS_REVIEW_1.pptxEIS_REVIEW_1.pptx
EIS_REVIEW_1.pptx
 
Lifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobileLifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobile
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
 
Online opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learningOnline opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learning
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
 
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chip
 
An35225228
An35225228An35225228
An35225228
 

More from Hiroki Nakahara

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROS
Hiroki Nakahara
 
FPGAX2019
FPGAX2019FPGAX2019
FPGAX2019
Hiroki Nakahara
 
SBRA2018講演資料
SBRA2018講演資料SBRA2018講演資料
SBRA2018講演資料
Hiroki Nakahara
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライド
Hiroki Nakahara
 
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
Hiroki Nakahara
 
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural NetworkISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
Hiroki Nakahara
 
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
Hiroki Nakahara
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
Hiroki Nakahara
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS
Hiroki Nakahara
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017
Hiroki Nakahara
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Hiroki Nakahara
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
Hiroki Nakahara
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
Hiroki Nakahara
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
Hiroki Nakahara
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGA
Hiroki Nakahara
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
Hiroki Nakahara
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)
Hiroki Nakahara
 
Naist2015 dec ver1
Naist2015 dec ver1Naist2015 dec ver1
Naist2015 dec ver1
Hiroki Nakahara
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Hiroki Nakahara
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
Hiroki Nakahara
 

More from Hiroki Nakahara (20)

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROS
 
FPGAX2019
FPGAX2019FPGAX2019
FPGAX2019
 
SBRA2018講演資料
SBRA2018講演資料SBRA2018講演資料
SBRA2018講演資料
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライド
 
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
 
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural NetworkISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
 
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGA
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)
 
Naist2015 dec ver1
Naist2015 dec ver1Naist2015 dec ver1
Naist2015 dec ver1
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
 

Recently uploaded

IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
Mani Krishna Sarkar
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
Tool and Die Tech
 
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
binna singh$A17
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
Servizi a rete
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
Mani Krishna Sarkar
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Sinan KOZAK
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.docCCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
Dss
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
Tool and Die Tech
 
Germany Offshore Wind 010724 RE (1) 2 test.pptx
Germany Offshore Wind 010724 RE (1) 2 test.pptxGermany Offshore Wind 010724 RE (1) 2 test.pptx
Germany Offshore Wind 010724 RE (1) 2 test.pptx
rebecca841358
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
IJAEMSJORNAL
 
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Miss Khusi #V08
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
hamedmustafa094
 
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdfGUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
ProexportColombia1
 
Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking
Md.Shohel Rana ( M.Sc in CSE Khulna University of Engineering & Technology (KUET))
 
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
sanabts249
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
IIIT Hyderabad
 
Lecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdfLecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdf
peacekipu
 
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafePaharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
 
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.docCCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
 
Germany Offshore Wind 010724 RE (1) 2 test.pptx
Germany Offshore Wind 010724 RE (1) 2 test.pptxGermany Offshore Wind 010724 RE (1) 2 test.pptx
Germany Offshore Wind 010724 RE (1) 2 test.pptx
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
 
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
Phone Us ❤ X000XX000X ❤ #ℂall #gIRLS In Chennai By Chenai @ℂall @Girls Hotel ...
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
 
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdfGUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
 
Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking
 
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
 
Lecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdfLecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdf
 
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafePaharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
 

FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

  • 1. High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression Hiroki Nakahara Tokyo Institute of Technology, JP Zhiqiang Que Wayne Luk Imperial College London, UK
  • 2. Outline • Background • JPEG compression for a high-speed inference • CNN model for an FPGA implementation • Channel shift and point-wise decomposition • Quantization strategy • Channel shuffle • Fully-pipelined CNN architecture • Experimental results • Conclusion 2
  • 3. Convolutional Neural Networks (CNNs) • High accuracy and many applications • Image recognitions, NLPs, data mining [1] • FPGAs on cloud services • Amazon AWS, Microsoft Azure, etc. 3 [1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng, “UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery and data mining (KDD), 2019, pp.3132–3142.
  • 4. Problems • Power consumption • Performance bottleneck (Data-transfer) • e.g., AWS F1 provides overall read/write at 6.5GB/s from host CPU to FPGA [1] 4 Host PC Interconnect PCIe CNN Kernel .jpg RAW(RGB) Img. Accelerator Card [1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
  • 5. Our Contributions • Customized JPEG for a high-speed data transfer → Compression ratio (Speed-up) vs. accuracy • Fully pipelined inference architecture w/ light-weight CNN 5 Host PC Interconnect FPGA PCIe CNN Kernel Interconnect FPGA PCIe CNN Kernel Decoder .jpg .jpg Host PC RAW(RGB) Img. (a) Conventional (b) Proposed Low-quality Img.
  • 9. Our Contributions • Customized JPEG for a high-speed data transfer → Compression ratio (Speed-up) vs. accuracy • Fully pipelined inference architecture w/ light-weight CNN 9 Host PC Interconnect FPGA PCIe CNN Kernel Interconnect FPGA PCIe CNN Kernel Decoder .jpg .jpg Host PC RAW(RGB) Img. (a) Conventional (b) Proposed Low-quality Img.
  • 10. Customized JPEG for a High-speed Inference 10
  • 12. Proposed JPEG Coding with a CNN Accelerator 12 Quant. Huffman Encoding Fully Pipelining CNN IDCT Reverse Quant. Huffman Decoding Host PC ImageStreamData JPEG Image .jpg Extreme Quant. Value q Quant. Table RAM PCIe Huffman Decoding & Reverse Quant. RAMRAM Detection Result FPGA Ping-pong Buffer Huffman Coding Table Huffman Coding Table
  • 13. Huffman Decoding and Reverse Quantization Unit 13 0 1 2 3 4 2 2 2 3 4 Shift Register Shift Value Quantized Value Quant. Value q Run-length Decoder 00** 01** 10** 110* 1110 Image Data Stream ... ... Priority Encoder Buffer RAM Zig-zag writing ADR WDATA Zig-zag pattern ROM
  • 14. • Decompose the 2D-IDCT with 16 1D-DCTs 14 2D-IDCT AP-922 Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine Transform Using the Streaming SIMD Extensions and MMX Instructions,” https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf
  • 15. 2D-IDCT Unit 15 .. Controller Operation Units Reg. 1D-IDCT Unit RAM RAM RAM • Two 1D-IDCT units • Use half precision (16 bits)
  • 16. CNN model for an FPGA Implementation 16
  • 17. Overview 1. Decomposing k×k convolution by channel shift [1] and point-wise (1×1) convolution 2. Binary (1-bit) weight quantization [2] 3. Channel split and shuffle [3] 17 [3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” CVPR, 2018. [1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez, and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,” CVPR, 2018, pp. 9127-9135. [2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.
  • 18. Building Blocks 18 #channel x2 (a) Plain block (b) Down-sampling block Channel Split Shift PWConv Shift PWConv Concat & Shuffle Channel Split Shift PWConv (s=2) Shift PWConv Concat & Shuffle Shift PWConv (s=2) #channel/2
  • 19. Our CNN Model 19 Layer Output size Kernel size Stride #Output channel Image 224 3 PWConv 224 1 2 24 Norm 224 1 1 24 Shift 224 1 1 24 Pool 112 3 1 24 PWConv 112 2 2 24 Norm 112 1 1 24 ReLU 112 1 1 24 Shift 112 3 1 24 Pool 56 2 2 24 Stage 2 28 116 (4 repeats) Stage 3 14 232 (8 repeats) Stage 4 7 464 (16 repeats) GAP 1 7 1 464 PWConv 1 1 1 1000 • Training-aware quantization • w: binary, a: 8-bit • 2.54 M params, 0.616 GMACs
  • 21. Dataflow for a Residual Stage of a Plain Block • Double buffers for branch-flow • Xilinx #pragma HLS dataflow 21 Layer Unit F.map Buffer ... ... ... ... Layer Unit ... ... Shuffle ...
  • 22. 2D Convolutional Unit 22 ... ... AdderTree BN Act W.mem ... ... ... ... c n p c n×p Convolution Unit
  • 23. Pooling Units 23 x00 x01 x02 x03 x04 x10 x11 x12 x13 x14 x20 x21 x22 x23 x24 x30 x31 x32 x33 x34 x40 x41 x42 x43 x44 x11 x10 x04 x03 x02 x01 x00 Write Ctrl. Logic F. Map Mem. (n=5, k=2) Shift Register Max Selector +F. Map Mem. Register Reset Write Ctrl. Logic 1 𝑛! Controller Max. Pooling Unit Global Ave. Pooling Unit
  • 25. Compression Ratio vs. Accuracy 25 162.2 124.6 82.1 53.5 34.9 11.5 59.61 66.64 70.8 71.1 71.2 71.2 50 55 60 65 70 75 0.0 50.0 100.0 150.0 200.0 q=1 q=2 q=3 q=4 q=5 Standard speed-up acc ImageNetTop-1Accuracy[%] DataTransferSpeed-UpRatio (Baseline:RGBImageTransfer) JPEG Quantization Bit • ImageNet2012 (224x224 pixel image) classification task • PyTorch 1.4.0 + modified libjpeg library Only decreases 0.3 point of accuracy and achieves 82.1 times speed-up
  • 26. Implementation Results Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs JPEG Decoder 11,675 6,646 34 2 0 Huffman Decoder 6,794 2,378 0 0 0 2D-IDCT 4,881 4,278 34 2 0 Pipelined-CNN 263,120 266,784 2,336 2,744 0 Total 274,795 273,440 2,370 2,746 16 (Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%) 26 • Xilinx Inc. Virtex UltraScale+ FPGA VCU1525 acceleration development kit • Xilinx Inc. SDAccel 2018.2 • Operates 300MHz@75Watt • System performance: 3321.25 FPS • JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS) • JPEG decoder part of the LUT was only 4.2% of total system resource
  • 27. Comparison with Other FPGA Implementations 27 Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours FPGA Stratix V Zynq ZU3EG Zynq ZU3EG Zynq ZU9EG Virtex US+ XCVU9P Virtex US+ XCVU9P FPS 864.7 200.0 96.5 809.8 123.1 3321.2 Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8% Top-5 Acc. 66.80% --- 88.12% --- --- 90.1% Precision (W/Act) 16/16 1/2 4/4 8/8 16/16 1/8 Freq.(MHz) 150 220 250 333 214 300 Power (W) 26.2 10.2 5.5 --- 49.25 75.0 1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018. 2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks,” 2018. 3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019. 4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143. 5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
  • 28. Comparison with CPU and GPU Platform CPU GPU FPGA Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM Throughput (FPS) 24.0 350.0 3321.25 Power (W) 95 295 75 Efficiency (FPS/W) 0.25 1.18 44.28 28 • Ubuntu 18.04 LTS with PyTorch 1.4.0 • 128 Batch with INT8 quantization (for CPU and GPU) Note: CPU and GPU did not use our JPEG compression scheme
  • 30. Conclusion • Customized JPEG compression for a high-speed inference • 82.1x speed-up, 0.3-point accuracy drop • CNN model for a fully-pipelined implementation • Channel shift and point-wise decomposition • Binary weight quantization • Channel split-shuffle operation • Fully-pipelined CNN architecture • Achieved 3,321 FPS@75W • Speed-up: 138.4x CPU, 9.5x GPU • Energy efficiency: 177.1x CPU, 37.5x GPU • Future works • Custom compression & Other DL applications 30
  • 31. Thank you Hiroki Nakahara (Tokyo Tech, JP) nakahara@ict.e.titech.ac.jp 31