short-paper

High-Throughput Edge Inference for BERT Models via Neural Architecture Search and Pipeline

Authors:

Hung-Yang Chang,

Seyyed Hasan Mozafari,

James J. Clark,

Brett H. Meyer, and

Warren J. GrossAuthors Info & Claims

GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023

June 2023

Pages 455 - 459

https://doi.org/10.1145/3583781.3590302

Published: 05 June 2023 Publication History

Abstract

There has been growing interest in improving the BERT inference throughput on resource-constrained edge devices for a satisfactory user experience. One methodology is to employ heterogeneous computing, which utilizes multiple processing elements to accelerate inference. Another methodology is to deploy Neural Architecture Search (NAS) to find optimal solutions in accuracy-throughput design space. In this paper, for the first time, we incorporate NAS with pipelining for BERT models. We show that performing NAS with pipelining achieves on average 53% higher throughput, compared to NAS with a homogeneous system.

Additionally, we propose a NAS algorithm that incorporates hardware performance feedback to accelerate the NAS process. Our proposed NAS algorithm speeds up the search process by ~4x, and 5.5x on the design space of the BERT and CNNs, respectively. Also, by exploring the accuracy-throughput design space of BERT models, we demonstrate that performing pipelining then NAS (Pipeline-then-NAS) can lead to solutions with up to 9x higher inference throughput, compared to running homogeneous inference on the BERT-base model, with only a 1.3% decrease in accuracy.

References

[1]

Ehsan Aghapour et al. 2021. Integrated ARM big.Little-Mali Pipeline for High-Throughput CNN Inference. TechRxiv (2021).

[2]

ARM. 2011. ARM big.LITTLE. (2011). https://www.arm.com/why-arm/technologies/big-little Accessed: 2023-01-30.

[3]

Han Cai et al. 2019. Once-for-all: Train one network and specialize it for efficient deployment. arXiv:1908.09791 (2019).

[4]

Hung-Yang Chang et al. 2022. PipeBERT: High-throughput BERT Inference for ARM Big. LITTLE Multi-core Processors. ACM JSPS (2022), 1--18.

[5]

Piotr Czyz.zak and Adrezej Jaszkiewicz. 1998. Pareto simulated annealing-a metaheuristic technique for multiple-objective combinatorial optimization. Journal of multi-criteria decision analysis, Vol. 7, 1 (1998), 34--47.

[6]

Kalyanmoy Deb et al. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, Vol. 6, 2 (2002), 182--197.

[7]

Jacob Devlin et al. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018).

[8]

Mitchell A Gordon et al. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning.

[9]

Tong He et al. 2019. Bag of tricks for image classification with convolutional neural networks. In CVPR. 558--567.

[10]

Lu Hou et al. 2020. Dynabert: Dynamic BERT with adaptive width and depth. NeurIPS, Vol. 33 (2020), 9782--9793.

[11]

Huawei. 2018. Hikey970. https://www.96boards.org/product/hikey970/ Accessed: 2023-01--30.

[12]

Forrest N Iandola et al. 2020. SqueezeBERT: What can computer vision teach NLP about efficient neural networks" arXiv:2006.11316 (2020).

[13]

A. Ignatov et al. 2019. AI Benchmark: All About Deep Learning on Smartphones in 2019. In IEEE/CVF ICCVW.

[14]

Telmo Pires et al. 2019. How multilingual is multilingual BERT" arXiv:1906.01502 (2019).

[15]

Zhiqing Sun et al. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. arXiv:2004.02984 (2020).

[16]

Henry Tsai et al. 2020. Finding fast transformers: One-shot neural architecture search by component composition. arXiv:2008.06808 (2020).

[17]

Ashish Vaswani et al. 2017. Attention is all you need. arXiv:1706.03762 (2017).

[18]

Alex Wang et al. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461 (2018).

[19]

Siqi Wang et al. 2019. High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors. IEEE TCAD, Vol. 39, 10 (2019), 2254--2267.

[20]

Zhifeng Xiao and Yang Xiao. 2013. Security and privacy in cloud computing. IEEE Commun. Surv. Tutor., Vol. 15, 2 (2013), 843--859.

[21]

Jin Xu et al. 2021. NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search. ACM KDD (2021), 1933--1943.

[22]

Ofir Zafrir et al. 2019. Q8BERT: Quantized 8Bit BERT. arXiv:1910.06188 (2019).

Index Terms

High-Throughput Edge Inference for BERT Models via Neural Architecture Search and Pipeline
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors
Abstract
Transformer-based models such as BERT model have achieved state-of-the-art accuracy in the natural language processing (NLP) tasks. Nevertheless, these models are extremely cumbersome and have low throughput in NLP inference. This is more ...
Read More
An Optimized High-Throughput Strategy for Constructing Inverted Files

Current high-throughput algorithms for constructing inverted files all follow the MapReduce framework, which presents a high-level programming model that hides the complexities of parallel programming. In this paper, we take an alternative approach and ...
Read More
Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

As a typical Gauss---Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss---Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023

June 2023

731 pages

ISBN:9798400701252

DOI:10.1145/3583781

General Chairs:
Himanshu Thapliyal
University of Tennessee, Knoxville, USA
,
Ronald DeMara
University of Central Florida, USA
,
Program Chairs:
Inna Partin-Vaisband
University of Illinois Chicago, USA
,
Srinivas Katkoori
University of South Florida, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Huawei Technologies Canada Inc.
onds de Recherche du QubecNature et Technologies (FRQNT) Postdoctoral Research Scholarship.

Conference

GLSVLSI '23

Sponsor:

SIGDA

GLSVLSI '23: Great Lakes Symposium on VLSI 2023

June 5 - 7, 2023

TN, Knoxville, USA

Acceptance Rates

Overall Acceptance Rate 312 of 1,156 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
161
Total Downloads

Downloads (Last 12 months)113
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents