skip to main content
short-paper

High-Throughput Edge Inference for BERT Models via Neural Architecture Search and Pipeline

Published: 05 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    There has been growing interest in improving the BERT inference throughput on resource-constrained edge devices for a satisfactory user experience. One methodology is to employ heterogeneous computing, which utilizes multiple processing elements to accelerate inference. Another methodology is to deploy Neural Architecture Search (NAS) to find optimal solutions in accuracy-throughput design space. In this paper, for the first time, we incorporate NAS with pipelining for BERT models. We show that performing NAS with pipelining achieves on average 53% higher throughput, compared to NAS with a homogeneous system.
    Additionally, we propose a NAS algorithm that incorporates hardware performance feedback to accelerate the NAS process. Our proposed NAS algorithm speeds up the search process by ~4x, and 5.5x on the design space of the BERT and CNNs, respectively. Also, by exploring the accuracy-throughput design space of BERT models, we demonstrate that performing pipelining then NAS (Pipeline-then-NAS) can lead to solutions with up to 9x higher inference throughput, compared to running homogeneous inference on the BERT-base model, with only a 1.3% decrease in accuracy.

    References

    [1]
    Ehsan Aghapour et al. 2021. Integrated ARM big.Little-Mali Pipeline for High-Throughput CNN Inference. TechRxiv (2021).
    [2]
    ARM. 2011. ARM big.LITTLE. (2011). https://www.arm.com/why-arm/technologies/big-little Accessed: 2023-01-30.
    [3]
    Han Cai et al. 2019. Once-for-all: Train one network and specialize it for efficient deployment. arXiv:1908.09791 (2019).
    [4]
    Hung-Yang Chang et al. 2022. PipeBERT: High-throughput BERT Inference for ARM Big. LITTLE Multi-core Processors. ACM JSPS (2022), 1--18.
    [5]
    Piotr Czyz.zak and Adrezej Jaszkiewicz. 1998. Pareto simulated annealing-a metaheuristic technique for multiple-objective combinatorial optimization. Journal of multi-criteria decision analysis, Vol. 7, 1 (1998), 34--47.
    [6]
    Kalyanmoy Deb et al. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, Vol. 6, 2 (2002), 182--197.
    [7]
    Jacob Devlin et al. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018).
    [8]
    Mitchell A Gordon et al. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning.
    [9]
    Tong He et al. 2019. Bag of tricks for image classification with convolutional neural networks. In CVPR. 558--567.
    [10]
    Lu Hou et al. 2020. Dynabert: Dynamic BERT with adaptive width and depth. NeurIPS, Vol. 33 (2020), 9782--9793.
    [11]
    Huawei. 2018. Hikey970. https://www.96boards.org/product/hikey970/ Accessed: 2023-01--30.
    [12]
    Forrest N Iandola et al. 2020. SqueezeBERT: What can computer vision teach NLP about efficient neural networks" arXiv:2006.11316 (2020).
    [13]
    A. Ignatov et al. 2019. AI Benchmark: All About Deep Learning on Smartphones in 2019. In IEEE/CVF ICCVW.
    [14]
    Telmo Pires et al. 2019. How multilingual is multilingual BERT" arXiv:1906.01502 (2019).
    [15]
    Zhiqing Sun et al. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. arXiv:2004.02984 (2020).
    [16]
    Henry Tsai et al. 2020. Finding fast transformers: One-shot neural architecture search by component composition. arXiv:2008.06808 (2020).
    [17]
    Ashish Vaswani et al. 2017. Attention is all you need. arXiv:1706.03762 (2017).
    [18]
    Alex Wang et al. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461 (2018).
    [19]
    Siqi Wang et al. 2019. High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core Processors. IEEE TCAD, Vol. 39, 10 (2019), 2254--2267.
    [20]
    Zhifeng Xiao and Yang Xiao. 2013. Security and privacy in cloud computing. IEEE Commun. Surv. Tutor., Vol. 15, 2 (2013), 843--859.
    [21]
    Jin Xu et al. 2021. NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search. ACM KDD (2021), 1933--1943.
    [22]
    Ofir Zafrir et al. 2019. Q8BERT: Quantized 8Bit BERT. arXiv:1910.06188 (2019).

    Index Terms

    1. High-Throughput Edge Inference for BERT Models via Neural Architecture Search and Pipeline

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023
        June 2023
        731 pages
        ISBN:9798400701252
        DOI:10.1145/3583781
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 05 June 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. arm big.little
        2. edge inference
        3. nas
        4. pipeline
        5. throughput

        Qualifiers

        • Short-paper

        Funding Sources

        • Huawei Technologies Canada Inc.
        • onds de Recherche du QubecNature et Technologies (FRQNT) Postdoctoral Research Scholarship.

        Conference

        GLSVLSI '23
        Sponsor:
        GLSVLSI '23: Great Lakes Symposium on VLSI 2023
        June 5 - 7, 2023
        TN, Knoxville, USA

        Acceptance Rates

        Overall Acceptance Rate 312 of 1,156 submissions, 27%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 161
          Total Downloads
        • Downloads (Last 12 months)113
        • Downloads (Last 6 weeks)2

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media