ARCO:Adaptive Multi-Agent Reinforcement Learning-Based Hardware/Software Co-Optimization Compiler for Improved Performance in DNN Accelerator Design

Arya Fayyazi University of Southern CaliforniaLos AngelesCAUSA afayyazi@usc.edu Mehdi Kamal University of Southern CaliforniaLos AngelesCAUSA mehdi.kamal@usc.edu  and  Massoud Pedram University of Southern CaliforniaLos AngelesCAUSA pedram@usc.edu
Abstract.

This paper presents ARCO, an adaptive Multi-Agent Reinforcement Learning (MARL)-based co-optimizing compilation framework designed to enhance the efficiency of mapping machine learning (ML) models - such as Deep Neural Networks (DNNs) - onto diverse hardware platforms. The framework incorporates three specialized actor-critic agents within MARL, each dedicated to a distinct aspect of compilation/optimization at an abstract level: one agent focuses on hardware, while two agents focus on software optimizations. This integration results in a collaborative hardware/software co-optimization strategy that improves the precision and speed of DNN deployments. Concentrating on high-confidence configurations simplifies the search space and delivers superior performance compared to current optimization methods. The ARCO framework surpasses existing leading frameworks, achieving a throughput increase of up to 37.95% while reducing the optimization time by up to 42.2% across various DNNs.

copyright: none

1. Introduction

The computational demands and complexity of deep neural networks (DNNs) have surged rapidly with their expanding applications across diverse industry sectors, including autonomous driving, medical imaging, and natural language processing (Hossain et al., 2023). These applications require high accuracy and real-time responsiveness, placing significant pressure on the computational infrastructure. Historically, performance optimization in environments like TensorFlow (Abadi et al., 2016) and PyTorch (Paszke et al., 2019) has relied on hand-optimized kernels such as NVIDIA’s cuDNN and Intel’s MKL. While these kernels are designed to optimize efficiency for particular hardware setups, they face challenges in adapting to the changing demands of modern neural architectures and their evolving computational requirements. The extensive number of parameters that need tuning and a huge hardware configuration search space pose significant hurdles (Ahn et al., 2020).

To address this growing complexity, there has been a shift towards automated compilation frameworks like TVM (Chen et al., 2018), TensorComprehensions (Vasilache et al., 2018), and AutoTVM (Chen et al., 2018). These frameworks move away from static, hand-tuned optimization strategies to dynamic, algorithm-driven approaches capable of adapting to various architectures and operational requirements. For example, AutoTVM utilizes a sophisticated approach involving boosted trees (Chen and Guestrin, 2016) to navigate the extensive configuration space of neural network code efficiently. This method predicts the performance of potential configurations and minimizes the need for exhaustive empirical testing, thereby expediting the optimization process. However, despite these technological advances, optimizing complex models requires substantial resources and compilation time, underscoring the ongoing challenges and gaps in current optimization methodologies. In addition to evolutionary strategies like genetic algorithms (Mirjalili and Mirjalili, 2019), machine-learning techniques have been instrumental in refining the search for optimal configurations across this complex landscape. When considering various machine learning algorithms, reinforcement learning (RL) stands out as a promising option for exploring design search spaces and optimizing tasks (Zhang et al., 2022; Wang et al., 2022; Bakshi and Johnsson, 2023; Ahn et al., 2020). RL operates by testing various strategies and utilizing feedback on their actions to maximize cumulative rewards. This learning process occurs through agents interacting with their environment, which can range from single to multi-agent setups. Multi-agent RL offers the advantage of enabling transfer learning and providing valuable insights into interconnected decision-making systems (Du and Ding, 2021).

To address these inefficiencies and push the boundaries of current compilation technologies, we propose an adaptive multi-agent reinforcement learning-based hardware/software co-optimization compiler called ARCO. At its core, this optimizing compiler relies on a multi-agent reinforcement learning (MARL) algorithm to optimize the mapping of the given model and the accelerator structure concurrently. ARCO utilizes three specialized agents—each designed to optimize different facets of the system architecture: two focus on software and DNN configurations, while the third optimizes the accelerator architecture. This actor-critic multi-agent approach facilitates a more integrated and holistic optimization process and allows for more granular control and customization of the optimization strategies employed.

The proposed framework efficiently narrows the vast search space by employing a soft threshold for selecting high-confidence configurations, focusing only on the most promising candidates. This methodology not only reduces computational overhead but also minimizes exploration time. Our method leverages the capabilities of VTA++ (Banerjee et al., 2021), a highly configurable deep learning accelerator, to investigate diverse architectural optimizations. ARCO utilizes Centralized Training with a Decentralized Execution (CTDE) strategy, wherein training occurs in a centralized manner. In contrast, execution is decentralized, enabling adaptive adjustments to the changing requirements of DNN workloads. This strategic approach efficiently narrows the extensive search space by utilizing a soft threshold to select high-confidence configurations, prioritizing the most promising candidates. The efficacy of the ARCO is assessed under a set of DNN models with a variety of architectures and compared with leading state-of-the-art approaches, and it has demonstrated substantial improvements in throughput.

The rest of the paper is organized as follows. Section 2 provides detailed background on the VTA++, CTDE in MARL, workflow for DNN compilers, and related works that have influenced our approach. ARCO and its detailed architecture and components are discussed in Section 3. In Section 4, the efficacy of the ARCO is assessed and compared to that of the established frameworks such as AutoTVM and CHAMELEON (Ahn et al., 2020). Finally, the paper concludes in Section 5.

2. Background

Refer to caption
Figure 1. High-Level architecture of VTA++.

This section briefly introduces the target hardware we used (i.e., VTA++), the Centralized Training and Decentralized Execution in Multi-Agent Reinforcement Learning, the workflow for DNN compilers, and related prior work.

2.1. VTA++

The Versatile Tensor Accelerator (VTA) (Moreau et al., 2019) has emerged as a highly effective deep learning hardware acceleration framework integrated within the TVM stack. To achieve a good hardware/software co-design, we need highly configurable target hardware, which we could optimize its architecture to achieve simultaneous hardware and software optimization. Therefore, we chose VTA++ (Banerjee et al., 2021), which has the same architecture as VTA but is more customizable. The high-level architecture of VTA++ is shown in Fig  1.

The General Matrix Multiply (GEMM) core is the cornerstone of VTA’s capabilities, carrying out dense matrix multiplications. This key operation underpins deep learning functions like 2D convolutions using convolutional layers and classification based on the extracted features using fully connected layers. This core is specifically engineered to handle complex arithmetic calculations by multiplying input and weight tensors and then adding the output to a register file tensor.

The GEMM core of VTA is defined by key hardware parameters: BATCH, BLOCK_IN, and BLOCK_OUT. BATCH size refers to the number of data samples processed in parallel, BLOCK_IN represents the inner dimension of the input tensor, and BLOCK_OUT pertains to the output dimension of the weight tensor and the accumulated result.

In this paper, we dedicate one of our actor-critic MARL agents to optimize the VTA GEMM core for enhanced hardware performance. By selectively tuning BATCH, BLOCK_IN, and BLOCK_OUT—referred to herein as ”hardware knobs”—we adapt the GEMM core to be optimized in our method. Altering BATCH size impacts the parallelism of data processing while modifying BLOCK_IN and BLOCK_OUT influences the granularity of the computation and the utilization of the on-chip memory resources. The goal is to strike an optimal balance between computational throughput and resource allocation, thereby maximizing the GEMM core’s efficiency. This RL agent learns to predict the optimal hardware configuration by interacting with the VTA environment and other software agents, thereby learning from the empirical performance outcomes of previous configurations.

2.2. CTDE in MARL

Multi-agent reinforcement learning (MARL) extends traditional reinforcement learning, introducing multiple agents within a shared environment. The complexity of agent interactions calls for sophisticated methodologies that coordinate individual agent strategies with collective objectives. The Centralized Training with Decentralized Execution (CTDE) framework has emerged as a significant paradigm in MARL, balancing the collective intelligence during training with the autonomy of individual agents during operation (Lyu et al., 2021).

During the centralized training phase within the CTDE paradigm, agents have access to a global state and learn policies that consider all agents’ collective state and actions. This centralized approach allows agents to develop cohesive strategies, effectively addressing cooperative, collaborative, competitive, and mixed tasks. The essence of CTDE is encapsulated in the adaptation of Proximal Policy Optimization (PPO) (Schulman et al., 2017) to MARL, leading to the development of Multi-Agent PPO (MAPPO) (Yu et al., 2022).

In the context of MAPPO, the learning process involves three pivotal components: critic learning, General Advantage Estimation (GAE), and policy learning. These components are encapsulated in the following mathematical expressions:

  • Critic learning ensures that each iteration yields an enhanced centralized value function. It is defined as:

    (1) ϕk+1=argminϕ1|Dk|TτDkt=0T(Vϕ(𝐨t,𝐬t,𝐮t)R^t)2subscriptitalic-ϕ𝑘1subscriptitalic-ϕ1subscript𝐷𝑘𝑇subscript𝜏subscript𝐷𝑘superscriptsubscript𝑡0𝑇superscriptsubscript𝑉italic-ϕ𝐨𝑡𝐬𝑡𝐮𝑡^𝑅𝑡2\phi_{k+1}=\arg\min_{\phi}\frac{1}{|D_{k}|T}\sum_{\tau\in D_{k}}\sum_{t=0}^{T}% \left(V_{\phi}(\mathbf{o}{t},\mathbf{s}{t},\mathbf{u}{t})-\hat{R}{t}\right)^{2}italic_ϕ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_o italic_t , bold_s italic_t , bold_u italic_t ) - over^ start_ARG italic_R end_ARG italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
  • General Advantage Estimation (GAE) gauges the quality of the current action relative to the baseline critic value, described by:

    (2) At=t=0(γλ)tδt+1Vsubscript𝐴𝑡superscriptsubscript𝑡0superscript𝛾𝜆𝑡subscriptsuperscript𝛿𝑉𝑡1A_{t}=\sum_{t=0}^{\infty}(\gamma\lambda)^{t}\delta^{V}_{t+1}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_γ italic_λ ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
  • Policy learning involves computing the policy gradient using the estimated advantage to update the policy function:

    (3) L(𝐨,𝐬,𝐮,𝐮,θk,θ)=𝐿𝐨𝐬𝐮superscript𝐮subscript𝜃𝑘𝜃absent\displaystyle L(\mathbf{o},\mathbf{s},\mathbf{u},\mathbf{u}^{\prime},\theta_{k% },\theta)=italic_L ( bold_o , bold_s , bold_u , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ ) = min(πθ(𝐮|𝐨)πθk(𝐮|𝐨)Aπθk(𝐨,𝐬,𝐮),\displaystyle\min\bigg{(}\frac{\pi_{\theta}(\mathbf{u}|\mathbf{o})}{\pi_{% \theta_{k}}}(\mathbf{u}|\mathbf{o})A^{\pi_{\theta_{k}}}(\mathbf{o},\mathbf{s},% \mathbf{u}^{\prime}),roman_min ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u | bold_o ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ( bold_u | bold_o ) italic_A start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_o , bold_s , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,
    clip(πθ(𝐮|𝐨)πθk,1ϵ,1+ϵ)Aπθk(𝐨,𝐬,𝐮))\displaystyle\text{clip}\left(\frac{\pi_{\theta}(\mathbf{u}|\mathbf{o})}{\pi_{% \theta_{k}}},1-\epsilon,1+\epsilon\right)A^{\pi_{\theta_{k}}}(\mathbf{o},% \mathbf{s},\mathbf{u}^{\prime})\bigg{)}clip ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u | bold_o ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , 1 - italic_ϵ , 1 + italic_ϵ ) italic_A start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_o , bold_s , bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )

where D𝐷Ditalic_D represents the trajectories that can be shared across agents, R^^𝑅\hat{R}over^ start_ARG italic_R end_ARG is the estimated reward, τ𝜏\tauitalic_τ denotes a trajectory, and A𝐴Aitalic_A is the advantage. The discount factor γ𝛾\gammaitalic_γ and the weight of the GAE λ𝜆\lambdaitalic_λ govern the temporal scope of the advantage estimation. In these expressions, 𝐮𝐮\mathbf{u}bold_u denotes the current agent action, while 𝐮superscript𝐮\mathbf{u}^{\prime}bold_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT encapsulates the action set of all agents except the acting agent. The global state is represented by 𝐬𝐬\mathbf{s}bold_s, and 𝐨𝐨\mathbf{o}bold_o signifies the local observation of an agent. The hyperparameter ϵitalic-ϵ\epsilonitalic_ϵ controls the policy update’s deviation from the previous policy. Vϕsubscript𝑉italic-ϕV_{\phi}italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is the shared value function, and πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the shared policy function, highlighting the collaborative essence of CTDE in MARL.

Empirical studies reveal that MAPPO can rival and, at times, outperform competitive off-policy algorithms, which have traditionally been preferred for their sample efficiency (Yu et al., 2022). These findings led us to use MAPPO as our optimization algorithm used in MARL.

2.3. Workflow for DNN compilers

Mapping DNN models into efficient machine code is crucial for their performance across various hardware platforms. This mapping is managed by a compiler workflow that optimizes the DNN model in several stages (Allen and Kennedy, 2002). Initially, the compiler’s frontend conducts general optimizations. These broad improvements do not yet account for the specific hardware where the model will run. Subsequently, backend optimizations consider the target hardware’s characteristics, shaping the code in a hardware-aware manner. Recent advancements like the AutoTVM system have introduced an innovative step that fine-tunes the code based on actual hardware performance feedback. This method explores various settings—referred to as ”knobs”—to determine the most effective configuration for the code to run efficiently. The search for this optimal configuration (ΘsuperscriptΘ\Theta^{*}roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) is captured by:

Θ=argmaxΘD{f[τ(Θ)]},superscriptΘΘ𝐷argmax𝑓delimited-[]𝜏Θ\Theta^{*}=\underset{\Theta\in D}{\text{argmax}}\left\{f\left[\tau(\Theta)% \right]\right\},roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT roman_Θ ∈ italic_D end_UNDERACCENT start_ARG argmax end_ARG { italic_f [ italic_τ ( roman_Θ ) ] } ,

where the term f𝑓fitalic_f represents the performance or ’fitness’ of the code, and ΘΘ\Thetaroman_Θ stands for a specific set of knob settings. The compiler’s goal is to discover the knob settings that maximize the code’s performance on the target hardware within the allowed configuration space D𝐷Ditalic_D.

2.4. Related Work

In the domain of DNN compilers, adaptive and efficient auto-tuning mechanisms have proven critical for optimizing performance on varying hardware architectures (Rieber et al., 2022; Ryu et al., 2022; Lu et al., 2022; Fayyazi et al., 2024; Dhakal et al., 2022). An interesting development in this field is AutoTVM (Chen et al., 2018), which utilizes a machine learning-based cost model, specifically XGBoost (Chen and Guestrin, 2016), for its auto-tuning process. This method, while effective, involves a large search space and substantial measurement overhead, making optimization resource-intensive.

MetaTune (Ryu and Sung, 2021) was introduced to address the limitations of black-box auto-tuning by leveraging meta-learning to allow its cost model to ”learn to learn” performance correlations. This enables MetaTune (Ryu and Sung, 2021) to adapt to new optimization spaces quickly and efficiently, demonstrating improvements in inference times compared to prior methods. Glimpse (Ahn et al., 2022) builds upon these approaches by integrating mathematical embeddings of hardware specifications, known as Blueprints, in a Bayesian optimization framework. These Blueprints guide the search algorithm towards subspaces with higher performance potential.

Another approach is CHAMELEON(Ahn et al., 2020), which employs reinforcement learning for its Adaptive Exploration and Sampling algorithms, minimizing invalid configurations and costly hardware measurements. CHAMELEON(Ahn et al., 2020) significantly reduces optimization time compared to AutoTVM while improving inference times. NaaS (Neural Architecture and Accelerator Search) (Zhou et al., 2022) expands upon these frameworks by jointly optimizing neural network architectures and hardware accelerators. It parameterizes both architectures in a unified search space using PyGlove and demonstrates efficiency gains across tasks like image classification and segmentation.

PRIME (Data-Driven Offline Optimization for Architecting Hardware Accelerators) (Kumar et al., 2021) introduces a novel data-driven method for designing hardware accelerators using logged simulation data to construct a robust, conservative surrogate model. By eliminating simulation-driven optimization’s repeated overhead, PRIME significantly reduces simulation time while architecting specialized accelerators for single- and multi-application tasks.

Comparing these methods highlights AutoTVM’s groundwork in machine learning-based auto-tuning. MetaTune and Glimpse advance this by incorporating meta-learning and hardware-aware strategies, respectively. CHAMELEON extends the reinforcement learning approach with improved sampling methods. NaaS and PRIME offer new dimensions by integrating accelerator co-design and offline data-driven optimization. However, despite the advances brought by NaaS and PRIME in hardware consideration, their lack of reinforcement learning makes them slower in compilation times, as they can’t rapidly navigate the expansive hardware and software search space. Reinforcement learning algorithms, as used in CHAMELEON, enable compilers to navigate possible configurations efficiently, identifying high-performance options with fewer tests. This results in a more agile optimization process, allowing for quicker deployment of DNN models and facilitating rapid progress in deep learning research and applications(Zhang et al., 2022; Wang et al., 2022; Bakshi and Johnsson, 2023; Ahn et al., 2020).

In comparing our approach to existing methods like CHAMELEON, Glimpse, and MetaTune—which focus on reducing compilation times through different optimization techniques—and co-design methods like NaaS and PRIME—which strive to enhance system throughput through various optimization strategies—our analysis underscores the distinct advantages of MARL in efficiently navigating the breadth and depth of the design space in modern DNN applications. While effective in their own ways, these methods often fail to fully capture the interdependencies between hardware and software optimizations that our framework does, which can result in suboptimal performance and inefficient throughput in real-world applications. Additionally, co-design optimizers like NaaS and PRIME, though aiming to streamline throughput, lack the capacity to effectively reduce optimization time within the vast hardware-software search space and require significant resources for compiling designs.

3. ARCO

As explained before, current systems fall short of effectively optimizing both the hardware and software components of DNNs simultaneously, as they tend to prioritize software parameters exclusively. Also, existing co-design approaches are too time-consuming to compile. To break free from prolonged optimization processes and fragmented tuning methods, it is important to take on two key challenges: 1) Improving the search algorithm to encompass hardware architecture and parameters as modifiable factors and 2) Refining the sampling process to capture the solution distribution better and minimize encountering non-feasible configurations.

We propose two key advancements in the co-optimizing compiler for DNNs to address these obstacles. Firstly, we incorporate a multi-agent reinforcement learning strategy into the search algorithm, allowing it to adapt to new design environments while simultaneously refining hardware and software parameters. This approach harnesses collective intelligence to achieve a more holistic and efficient optimization approach. Secondly, we introduce a novel Confidence Sampling (CS) method to replace the existing uniform (Chen et al., 2018) and adaptive sampling methods (Ahn et al., 2020), which often are random and rely on trial and error. The proposed CS method identifies configurations with a higher probability of success, guided by learned patterns from previous successful optimizations.

3.1. Overview of ARCO

Refer to caption
Figure 2. Overall search flow of ARCO.

Figure  2 depicts the architecture of our sophisticated co-optimizing compiler, termed ARCO, which elucidates the process of optimization compilation. ARCO initiates with a code template, τ𝜏\tauitalic_τ, for each layer of the neural network and a corresponding design space DΘsubscript𝐷ΘD_{\Theta}italic_D start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, delineating the range of possible configurations. The compiler then engages in an optimization process, navigating through the configuration space ΘΘ\Thetaroman_Θ to ascertain the optimal code configuration, represented as τ(Θ)𝜏superscriptΘ\tau(\Theta^{*})italic_τ ( roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

ARCO maneuvers through the design space, employing a cost model as a surrogate for direct hardware measurements. The MARL Exploration module generates an initial set of candidate configurations SΘsubscript𝑆ΘS_{\Theta}italic_S start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT. Through the CS method, ARCO refines SΘsubscript𝑆ΘS_{\Theta}italic_S start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT to produce a more focused subset SΘsubscriptsuperscript𝑆ΘS^{\prime}_{\Theta}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, which contains a condensed yet highly promising set of configurations. The configurations in SΘsubscriptsuperscript𝑆ΘS^{\prime}_{\Theta}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT are then passed to the MARL Code Generator module, which incorporates them with the input template τ𝜏\tauitalic_τ to create a series of potential executable codes τ(Θ)𝜏Θ\tau(\Theta)italic_τ ( roman_Θ ). These are subsequently deployed on the hardware for empirical runtime evaluations. The hardware runtimes provide a measure of the configurations’ fitness, captured by a fitness function f𝑓fitalic_f, which informs the update of the cost model and enhances the exploration in subsequent iterations. After several iterations, the process converges to some τ(Θ)𝜏superscriptΘ\tau(\Theta^{*})italic_τ ( roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) that achieves optimal fitness, characterized by the shortest hardware execution runtime. This configuration (Θ)superscriptΘ(\Theta^{*})( roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is then selected as the output for that network layer.

3.2. MARL Exploration

Refer to caption
Figure 3. High-level view of MARL Exploration Module. Each Agent has a policy network and, based on the centralized critic feedback, it will do an action in its own environment.

In our work, we implement a MARL Exploration module (As shown in Figure 3) that employs a Multi-Agent Reinforcement Learning (MARL) strategy designed to optimize the configurations of DNN architectures and hardware simultaneously. This module utilizes the principle of Centralized Training and Decentralized Execution (CTDE), allowing agents to learn collaboratively while operating independently during the execution phase to maximize the fitness function, f𝑓fitalic_f, of configurations within the configuration space, SΘsubscript𝑆ΘS_{\Theta}italic_S start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT. MARL’s ability to manage complex decision-making environments makes it particularly suitable for navigating the high-dimensional configuration space of DNNs.

The MARL Exploration strategy employs three agents, each equipped with a policy network, as detailed in Table 1. Additionally, there is a centralized critic network (value network) that, along with the policy networks, is implemented as a Multi-Layer Perceptron (MLP) neural network. The policy network directs each agent to propose adjustments to the configuration knobs (as shown in Table 2) within its assigned portion of the design space. In contrast, the value network estimates the potential value of these adjustments. A shared central critic aids in evaluating the global state, facilitating collective learning during the training phase, while each agent makes decisions independently during execution, adhering to the CTDE paradigm.

Table 1. Roles and Responsibilities of Different Agents
Agent Type Description
Software Agent 1
(Scheduling Agent)
This agent focuses on how tasks are parallelized and distributed, which is crucial for effective scheduling. Scheduling tasks typically involve deciding how computations are assigned and executed in parallel threads.
Software Agent 2
(Mapping Agent)
This agent deals with how data (namely, tensor height and width) is divided and processed. The focus on specific data dimensions makes this agent suitable for determining how computations map onto the hardware, dealing with spatial data distribution.
Hardware Agent This agent adjusts parameters that directly influence how different components of the tensors (batches, input channels, and output channels) are divided and processed by the hardware. This suggests a deep focus on optimizing the hardware utilization based on the hardware’s capabilities and limitations (namely, resource counts and performance levels.)

The optimization unfolds through several episodes, each consisting of multiple search steps. During these steps, the agents evaluate the current configuration and independently decide on actions to improve subsequent configuration fitness, guided by shared insights and individual observations. The process detailed in Algorithm 1 outlines how each agent interacts within the CTDE framework, iterating through configurations to enhance the overall network performance.

Algorithm 1 CTDE MARL Exploration for DNN Configuration Optimization
1:Initialize centralized critic and individual policy networks for each agent
2:for each episode do
3:     Initialize a set of configurations, SΘsubscript𝑆ΘS_{\Theta}italic_S start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT
4:     for each search step in the episode do
5:         for each agent do
6:               Observe the current configuration, ΘΘ\Thetaroman_Θ, and receive shared insights from the centralized critic
7:               Independently chooses an action based on the policy network and local observations
8:              Apply the action to update the configuration
9:              Estimate the new configuration’s local value
10:         end for
11:          Collectively evaluate all new configurations using the cost model
12:          Update the centralized critic based on global performance feedback
13:          Individually update each agent’s policy network based on local and shared feedback
14:     end for
15:      Determine the configuration with the highest fitness from collective insights
16:end for
17:Return the optimal configuration, τ(Θ)𝜏superscriptΘ\tau(\Theta^{*})italic_τ ( roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

Each episode integrates the evaluations from a cost model that serves as a surrogate for direct hardware performance measurements. This model’s feedback is used to update the centralized critic, enhance the shared knowledge base, and refine the individual policy networks, refining each agent’s decision-making capabilities based on global and local feedback. As the episodes progress, the MARL Exploration Module becomes increasingly adept at identifying configurations that yield the best performance, ultimately converging on τ(Θ)𝜏superscriptΘ\tau(\Theta^{*})italic_τ ( roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) with the optimal fitness, f𝑓fitalic_f.

Applying this CTDE framework within the MARL context, coupled with the PPO optimization algorithm, represents a significant advancement in the auto-tuning of DNN and hardware configurations. This approach systematically improves the configuration search process, significantly enhancing DNN performance while reducing the computational overhead.

3.2.1. Cost Model and Central Critic Update

In the ARCO framework, the optimization goal is to improve throughput, with the cost model reflecting the inverse of execution time. The central critic, a key component of our MARL setup, updates its policy based on the aggregated feedback from all agents, continuously optimizing the global state evaluation. The critic’s learning process is guided by the update rule, which is mathematically expressed in section 2.2, which minimizes the mean squared error between the predicted values and the actual rewards obtained.

3.2.2. Incorporating Hardware and Software Constraints

To integrate constraints related to hardware, such as area limitations, or software, such as memory usage, into the MARL framework, a penalty term can be incorporated into the reward function. This approach adjusts the reward based on the degree to which the constraints are violated, effectively guiding the agents to prefer configurations that adhere to these constraints. For instance, a penalty function P𝑃Pitalic_P can be defined as follows:

P(Θ)=𝑃Θabsent\displaystyle P(\Theta)=italic_P ( roman_Θ ) = λ(max(0,area(Θ)areamax)\displaystyle\lambda\left(\max(0,\text{area}(\Theta)-\text{area}_{\text{max}})\right.italic_λ ( roman_max ( 0 , area ( roman_Θ ) - area start_POSTSUBSCRIPT max end_POSTSUBSCRIPT )
(4) +max(0,memory(Θ)memorymax))\displaystyle\left.+\max(0,\text{memory}(\Theta)-\text{memory}_{\text{max}})\right)+ roman_max ( 0 , memory ( roman_Θ ) - memory start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) )

where λ𝜆\lambdaitalic_λ is a scaling factor that adjusts the impact of the penalty, area(Θ)areaΘ\text{area}(\Theta)area ( roman_Θ ) and memory(Θ)memoryΘ\text{memory}(\Theta)memory ( roman_Θ ) are the area and memory usage of the configuration ΘΘ\Thetaroman_Θ, and areamaxsubscriptareamax\text{area}_{\text{max}}area start_POSTSUBSCRIPT max end_POSTSUBSCRIPT and memorymaxsubscriptmemorymax\text{memory}_{\text{max}}memory start_POSTSUBSCRIPT max end_POSTSUBSCRIPT are the respective maximums.

The reward function, modified to include this penalty, becomes:

(5) Rt=1execution time(Θ)P(Θ)subscript𝑅𝑡1execution timeΘ𝑃ΘR_{t}=\frac{1}{\text{execution time}(\Theta)}-P(\Theta)italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG execution time ( roman_Θ ) end_ARG - italic_P ( roman_Θ )

This modification ensures that the MARL agents not only seek to optimize performance but also adhere to specified design constraints, balancing efficiency with practical deployment considerations.

Table 2. Knobs in the design space to optimize convolution layers (These knobs make a search space as big as O(212superscript2122^{12}2 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT))
Agent Type Knobs
Hardware Design Agent (Hardware Optimizer) tile_b, tile_ci, tile_co
- Tile across batch size (tile_b)
- Tile across input channels (tile_ci)
- Tile across output channels (tile_co)
Scheduling Agent (Software Optimizer) h_threading, oc_threading
- Horizontal threading (h_threading)
- Output channel threading (oc_threading)
Mapping Agent (Software Optimizer) tile_h, tile_w
- Tile across height (tile_h)
- Tile across width (tile_w)

3.3. Confidence Sampling

In the domain of DNN optimization, the ability to efficiently sample configuration spaces without costly hardware measurements is essential for rapid compilation. We introduce the Confidence Sampling (CS) method. This technique is designed to leverage the probability distribution of configurations’ fitness judiciously to select a subset likely to yield high-performing network configurations.

Although the Confidence Sampling method operates on the principles of probability-guided selection and shares some conceptual similarities with Importance Sampling (IS) (Tokdar and Kass, 2010) in statistical methods, it is distinct in its purpose and application. In IS, samples are re-weighted based on their probability density to approximate expected values more efficiently. Confidence Sampling, however, specifically focuses on reducing the number of hardware measurements required when exploring the design space. By evaluating the configurations’ values, this method synthesizes a subset that not only encapsulates the diversity of the configuration space but also emphasizes the high-confidence regions, i.e., areas where the likelihood of encountering an optimal or near-optimal configuration is high.

The core of the CS method lies in its algorithmic approach to filtering and enhancing the set of configurations for subsequent optimization iterations. To formalize this process, we introduce Algorithm 2, which captures the essence of the CS method within the optimization pipeline. The process involves the following steps:

  1. (1)

    Evaluate Configurations: Each configuration’s value is estimated using the output of the value network (critic network) to construct a probability distribution over the configuration space. (Line 2)

  2. (2)

    Probability-Guided Selection: Configurations are sampled based on this distribution, prioritizing those with higher estimated values. (Line 3-4)

  3. (3)

    Confidence Assessment: A dynamic thresholding mechanism is applied to determine the confidence level of each selected configuration. (Line 5 )

  4. (4)

    Synthesis of Configurations: Lower-confidence configurations are replaced with synthesized ones by combining each parameter’s most frequently occurring settings across the sampled configurations. (Line 6-7)

Algorithm 2 Confidence Sampling for DNN Optimization
1:procedure ConfidenceSampling(𝒮Θsubscript𝒮Θ\mathcal{S}_{\Theta}caligraphic_S start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, value_network𝑣𝑎𝑙𝑢𝑒_𝑛𝑒𝑡𝑤𝑜𝑟𝑘value\_networkitalic_v italic_a italic_l italic_u italic_e _ italic_n italic_e italic_t italic_w italic_o italic_r italic_k, Nselectsubscript𝑁𝑠𝑒𝑙𝑒𝑐𝑡N_{select}italic_N start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT)
2:      𝒱predsvalue_network.predict(𝒮Θ)formulae-sequencesubscript𝒱𝑝𝑟𝑒𝑑𝑠𝑣𝑎𝑙𝑢𝑒_𝑛𝑒𝑡𝑤𝑜𝑟𝑘𝑝𝑟𝑒𝑑𝑖𝑐𝑡subscript𝒮Θ\mathcal{V}_{preds}\leftarrow value\_network.predict(\mathcal{S}_{\Theta})caligraphic_V start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_s end_POSTSUBSCRIPT ← italic_v italic_a italic_l italic_u italic_e _ italic_n italic_e italic_t italic_w italic_o italic_r italic_k . italic_p italic_r italic_e italic_d italic_i italic_c italic_t ( caligraphic_S start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ) \triangleright Estimate values for all configurations
3:      𝒫softmax(𝒱preds)𝒫𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝒱𝑝𝑟𝑒𝑑𝑠\mathcal{P}\leftarrow softmax(\mathcal{V}_{preds})caligraphic_P ← italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( caligraphic_V start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_s end_POSTSUBSCRIPT ) \triangleright Convert values to a probability distribution
4:     𝒞selectedSelectConfigurations(𝒫,Nconfigs)subscript𝒞𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑SelectConfigurations𝒫subscript𝑁𝑐𝑜𝑛𝑓𝑖𝑔𝑠\mathcal{C}_{selected}\leftarrow\textsc{SelectConfigurations}(\mathcal{P},N_{% configs})caligraphic_C start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT ← SelectConfigurations ( caligraphic_P , italic_N start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f italic_i italic_g italic_s end_POSTSUBSCRIPT )
5:     thresholdComputeDynamicThreshold(𝒱preds)𝑡𝑟𝑒𝑠𝑜𝑙𝑑ComputeDynamicThresholdsubscript𝒱𝑝𝑟𝑒𝑑𝑠threshold\leftarrow\textsc{ComputeDynamicThreshold}(\mathcal{V}_{preds})italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d ← ComputeDynamicThreshold ( caligraphic_V start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_s end_POSTSUBSCRIPT ) \triangleright Compute an adaptive threshold
6:     𝒞high_conf{c𝒞selected𝒱preds[c]>threshold}subscript𝒞𝑖𝑔_𝑐𝑜𝑛𝑓conditional-set𝑐subscript𝒞𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑subscript𝒱𝑝𝑟𝑒𝑑𝑠delimited-[]𝑐𝑡𝑟𝑒𝑠𝑜𝑙𝑑\mathcal{C}_{high\_conf}\leftarrow\{c\in\mathcal{C}_{selected}\mid\mathcal{V}_% {preds}[c]>threshold\}caligraphic_C start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h _ italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT ← { italic_c ∈ caligraphic_C start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT ∣ caligraphic_V start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_s end_POSTSUBSCRIPT [ italic_c ] > italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d }
7:     return 𝒞high_confsubscript𝒞𝑖𝑔_𝑐𝑜𝑛𝑓\mathcal{C}_{high\_conf}caligraphic_C start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h _ italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT
8:end procedure
9:function SelectConfigurations(𝒫𝒫\mathcal{P}caligraphic_P, Nconfigssubscript𝑁𝑐𝑜𝑛𝑓𝑖𝑔𝑠N_{configs}italic_N start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f italic_i italic_g italic_s end_POSTSUBSCRIPT)
10:      indices𝑖𝑛𝑑𝑖𝑐𝑒𝑠absentindices\leftarrowitalic_i italic_n italic_d italic_i italic_c italic_e italic_s ← sample indices from [1,,|𝒫|]1𝒫[1,\ldots,|\mathcal{P}|][ 1 , … , | caligraphic_P | ] with probability 𝒫𝒫\mathcal{P}caligraphic_P
11:     return 𝒮Θ[indices[1:Nconfigs]]\mathcal{S}_{\Theta}[indices[1:N_{configs}]]caligraphic_S start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT [ italic_i italic_n italic_d italic_i italic_c italic_e italic_s [ 1 : italic_N start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f italic_i italic_g italic_s end_POSTSUBSCRIPT ] ]
12:end function
13:function ComputeDynamicThreshold(𝒱predssubscript𝒱𝑝𝑟𝑒𝑑𝑠\mathcal{V}_{preds}caligraphic_V start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_s end_POSTSUBSCRIPT)
14:     return median(𝒱preds)𝑚𝑒𝑑𝑖𝑎𝑛subscript𝒱𝑝𝑟𝑒𝑑𝑠median(\mathcal{V}_{preds})italic_m italic_e italic_d italic_i italic_a italic_n ( caligraphic_V start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_s end_POSTSUBSCRIPT )
15:end function

As part of the larger system, the CS method interacts closely with the exploration and exploitation mechanisms to ensure that the exploration is directed toward the most promising regions of the design space. By applying the CS method, we have observed a reduced number of required configurations, as shown in Figure  4. Moreover, we can observe the sampling process gravitates towards configurations that demonstrate superior performance over time, highlighting the efficacy of this method.

Refer to caption
(a)
Refer to caption
(b)
Figure 4. Configurations over time for ResNet-18 model a) before and b) after applying the CS method.

4. Experimental Results

4.1. Experimental Setup

We have incorporated ARCO within the TVM framework to thoroughly evaluate its components, contrasting its performance with that of AutoTVM and CHAMELEON. This integration allows for a comprehensive, end-to-end assessment of ARCO across a suite of advanced deep-learning models, including AlexNet, VGG-11, VGG-13, VGG-16, VGG-19, ResNet-18, and ResNet-34. All models were benchmarked using the same AutoTVM compilation duration to ensure a rigorous and equitable comparison. Further, we have employed the source code of the CHAMELEON (Ahn et al., 2020) to compare its efficacy with AutoTVM and ARCO. As mentioned above, AutoTVM and CHAMELEON do not support hardware configuration exploration; thus, in the case of these two exploration techniques, we have considered the default specification values of the VTA++ (Banerjee et al., 2021). Performance tests were conducted on a robust platform powered by a 3.4 GHz AMD EPYC 7763 64-Core Processor, providing a stable and powerful environment for these evaluations. As discussed before, we used the VTA++ simulator as our target hardware.

Table 3 shows the details of the DNN models employed for the comparison studies. In our setup, each agent is equipped with its own policy network, and a centralized value network functions as the critic for all agents. Both networks are implemented as Multi-Layer Perceptrons (MLPs) using TensorFlow (Abadi et al., 2016) that their details are:

  • Policy Network: Each agent’s policy network consists of a single hidden layer with 20 neurons using ReLU activation functions. The output layer uses softmax activation to generate a probability distribution over actions, allowing for decision-making informed by learned policies.

  • Value Network: The centralized critic, or value network, employs three hidden layers with 20 neurons each, using tanh activation functions. This design helps stabilize learning by providing a continuous estimate of state values, which guides collective policy adjustments.

Table 3. Details of the DNN models used in evaluating ARCO (All of the models are extracted from Mxnet (Chen et al., 2015)).
Network Dataset Number of Convolution Tasks
AlexNet(Krizhevsky et al., 2012) ImageNet 5
VGG-11(Simonyan and Zisserman, 2014) ImageNet 8
VGG-13(Simonyan and Zisserman, 2014) ImageNet 10
VGG-16(Simonyan and Zisserman, 2014) ImageNet 13
VGG-19(Simonyan and Zisserman, 2014) ImageNet 16
ResNet-18(He et al., 2016) ImageNet 17
ResNet-34(He et al., 2016) ImageNet 33

Hyper-parameter tuning is crucial in optimizing the performance of machine learning tools and models (Yu and Zhu, 2020). To facilitate transparency and reproducibility, we have detailed the hyper-parameters utilized in our evaluation in Table 4. These hyper-parameters were meticulously tuned offline to optimize model performance, and they are the same as those used in CHAMELEON (Ahn et al., 2020). For the hyper-parameters listed in Table 5, we adhered to the values specified in the AutoTVM study to maintain consistency and ensure a fair comparison. Similarly, we adopted the hyper-parameter settings from the MAPPO paper (Yu et al., 2022) for the MARL exploration module, aligning our methodologies with established research to validate our findings effectively.

Table 4. Hyper-parameters (HPs) used in ARCO.
HP Value Description
iteration_opt 16
Total iterations for the optimization cycle
(equivalent to 1000 hardware measurements)
modeGBT xgb-reg Loss function type utilized in the cost model
bGBT 64 Planning’s maximum batch size in GBT (Chen et al., 2018)
episode_rl 128 Number of episodes for the RL process
step_rl 500 Maximum # of steps in a single RL episode
Table 5. Hyper-parameters (HPs) used in AutoTVM
HP Value Description
Σ(bGBT)Σsubscript𝑏𝐺𝐵𝑇\Sigma(b_{GBT})roman_Σ ( italic_b start_POSTSUBSCRIPT italic_G italic_B italic_T end_POSTSUBSCRIPT ) 1000 Total count of hardware measurements
modeGBT𝑚𝑜𝑑subscript𝑒𝐺𝐵𝑇mode_{GBT}italic_m italic_o italic_d italic_e start_POSTSUBSCRIPT italic_G italic_B italic_T end_POSTSUBSCRIPT xgb-reg Loss function type for the cost model
bGBTsubscript𝑏𝐺𝐵𝑇b_{GBT}italic_b start_POSTSUBSCRIPT italic_G italic_B italic_T end_POSTSUBSCRIPT 64 Batch size for planning in GBT (Chen et al., 2018)
nsasubscript𝑛𝑠𝑎n_{sa}italic_n start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT 128 Count of Markov chains in parallel SA
stepsa𝑠𝑡𝑒subscript𝑝𝑠𝑎step_{sa}italic_s italic_t italic_e italic_p start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT 500 Maximum # of steps in a single SA run
  • SA: Simulated Annealing.

4.2. End-to-end Evaluation

Figure 5 Compares the achieved throughput of different frameworks over AutoTVM. As the results illustrate, the significant reduction in mean inference time (advancement in throughput) achieved by ARCO compared to AutoTVM and CHAMELEON for various DNN models: AlexNet, VGG-11, VGG-13, VGG-16, VGG-19, and ResNet-18, and ResNet-34. On average, our methodology demonstrates a 1.17× improvement in throughput. Table 6 reports detailed numerical results supporting these findings.

Refer to caption
Figure 5. Comparing the achieved throughput of different frameworks over AutoTVM on VTA++.
Table 6. Mean inference times for different frameworks on VTA++ (in seconds)
Model AutoTVM CHAMELEON ARCO
ResNet-18 1.730 611.730611.730\,611.730 61 1.595 951.595951.595\,951.595 95 1.25448
ResNet-34 3.634 093.634093.634\,093.634 09 3.537 803.537803.537\,803.537 80 2.66561
AlexNet 0.541 160.541160.541\,160.541 16 0.589 810.589810.589\,810.589 81 0.42160
VGG11 4.952 294.952294.952\,294.952 29 4.212 834.212834.212\,834.212 83 4.20685
VGG13 6.192 896.192896.192\,896.192 89 6.163 506.163506.163\,506.163 50 6.03418
VGG16 8.558 788.558788.558\,788.558 78 8.730 898.730898.730\,898.730 89 8.34791
VGG19 11.291 5011.2915011.291\,5011.291 50 11.288 3911.2883911.288\,3911.288 39 11.27800
Refer to caption
Figure 6. Comparing the compilation time of different frameworks (The percentages show the speedup of ARCO compared to AutoTVM).

While achieving improvement in throughput, ARCO also needs less compilation time compared to AutoTVM; Figure 6 shows that ARCO speeds up the optimization time up to 42.2%. Figure 7 shows ARCO’s comparative output code performance relative to other frameworks. Notably, ARCO achieves convergence to the peak GFLOPS value, equivalent to those reached by AutoTVM and CHAMELEON (without adaptive sampling), but with greater efficiency. This is accomplished with fewer hardware measurements and at a faster rate, underscoring the efficacy of the CS method.

Refer to caption
Figure 7. Comparing the compiled code performance of different frameworks for ResNet-18 model.

The overall efficiency gains are primarily attributable to the sophisticated MARL exploration techniques and the reduced hardware measurements necessitated by implementing the CS method.

5. Conclusion

This paper introduced ARCO, an innovative multi-agent reinforcement learning-based compiler that significantly advances the field of DNN accelerator design. By integrating a multi-agent system with specialized roles—two agents focused on software optimization and one on hardware—ARCO addressed the intricacies of DNN architecture and effectively navigated the vast configuration space to enhance performance and efficiency. Employing the MARL Exploration module and Confidence Sampling method, ARCO reduced the computational overhead and expedited the optimization process while increasing the throughput. Our experimental results, conducted on advanced DNN models like AlexNet, VGG, and ResNet series, demonstrated that ARCO outperforms traditional frameworks like AutoTVM and CHAMELEON, improving throughput by up to 37.95% and reducing optimization time significantly (Up to 42.2%). These enhancements underscore the potential of adaptive, intelligent systems in optimizing hardware and software layers cohesively, setting a new benchmark for future developments in real-time DNN applications.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. {{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283.
  • Ahn et al. (2022) Byung Hoon Ahn, Sean Kinzer, and Hadi Esmaeilzadeh. 2022. Glimpse: mathematical embedding of hardware specification for neural compilation. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 1165–1170.
  • Ahn et al. (2020) Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, and Hadi Esmaeilzadeh. 2020. Chameleon: Adaptive code optimization for expedited deep neural network compilation. arXiv preprint arXiv:2001.08743 (2020).
  • Allen and Kennedy (2002) Randy Allen and Ken Kennedy. 2002. Optimizing compilers for modern architectures: a dependence-baced approach. (No Title) (2002).
  • Bakshi and Johnsson (2023) Suyash Bakshi and Lennart Johnsson. 2023. Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning. ACM Transactions on Embedded Computing Systems 22, 5s (2023), 1–21.
  • Banerjee et al. (2021) Suvadeep Banerjee, Steve Burns, Pasquale Cocchini, Abhijit Davare, Shweta Jain, Desmond Kirkpatrick, Anton Sorokin, Jin Yang, and Zhenkun Yang. 2021. A highly configurable hardware/Software stack for DNN inference acceleration. arXiv preprint arXiv:2111.15024 (2021).
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
  • Chen et al. (2015) Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
  • Chen et al. (2018) Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.
  • Dhakal et al. (2022) Aditya Dhakal, KK Ramakrishnan, Sameer G Kulkarni, Puneet Sharma, and Junguk Cho. 2022. Slice-tune: A system for high performance dnn autotuning. In Proceedings of the 23rd ACM/IFIP International Middleware Conference. 228–240.
  • Du and Ding (2021) Wei Du and Shifei Ding. 2021. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artificial Intelligence Review 54, 5 (2021), 3215–3238.
  • Fayyazi et al. (2024) Arash Fayyazi, Mahdi Nazemi, Arya Fayyazi, and Massoud Pedram. 2024. NeuroBlend: Towards Low-Power yet Accurate Neural Network-Based Inference Engine Blending Binary and Fixed-Point Convolutions. arXiv:2307.03784 [cs.AR]
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hossain et al. (2023) Md Bipul Hossain, Na Gong, and Mohamed Shaban. 2023. Computational Complexity Reduction Techniques for Deep Neural Networks: A Survey. In 2023 IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings). IEEE, 1–6.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
  • Kumar et al. (2021) Aviral Kumar, Amir Yazdanbakhsh, Milad Hashemi, Kevin Swersky, and Sergey Levine. 2021. Data-driven offline optimization for architecting hardware accelerators. arXiv preprint arXiv:2110.11346 (2021).
  • Lu et al. (2022) Bingqian Lu, Zheyu Yan, Yiyu Shi, and Shaolei Ren. 2022. A Semi-Decoupled Approach to Fast and Optimal Hardware-Software Co-Design of Neural Accelerators. arXiv preprint arXiv:2203.13921 (2022).
  • Lyu et al. (2021) Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. 2021. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402 (2021).
  • Mirjalili and Mirjalili (2019) Seyedali Mirjalili and Seyedali Mirjalili. 2019. Genetic algorithm. Evolutionary algorithms and neural networks: Theory and applications (2019), 43–55.
  • Moreau et al. (2019) Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, et al. 2019. A hardware–software blueprint for flexible deep learning specialization. IEEE Micro 39, 5 (2019), 8–16.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
  • Rieber et al. (2022) Dennis Rieber, Moritz Reiber, Oliver Bringmann, and Holger Fröning. 2022. Hw-aware initialization of dnn auto-tuning to improve exploration time and robustness. arXiv preprint arXiv:2205.15568 (2022).
  • Ryu et al. (2022) Jaehun Ryu, Eunhyeok Park, and Hyojin Sung. 2022. One-shot tuner for deep learning compilers. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 89–103.
  • Ryu and Sung (2021) Jaehun Ryu and Hyojin Sung. 2021. Metatune: Meta-learning based cost model for fast and efficient auto-tuning frameworks. arXiv preprint arXiv:2102.04199 (2021).
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Tokdar and Kass (2010) Surya T Tokdar and Robert E Kass. 2010. Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics 2, 1 (2010), 54–60.
  • Vasilache et al. (2018) Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018).
  • Wang et al. (2022) Huanting Wang, Zhanyong Tang, Cheng Zhang, Jiaqi Zhao, Chris Cummins, Hugh Leather, and Zheng Wang. 2022. Automating reinforcement learning architecture design for code optimization. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 129–143.
  • Yu et al. (2022) Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. 2022. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems 35 (2022), 24611–24624.
  • Yu and Zhu (2020) Tong Yu and Hong Zhu. 2020. Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689 (2020).
  • Zhang et al. (2022) Zining Zhang, Bingsheng He, and Zhenjie Zhang. 2022. HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler for Neural Networks. In Proceedings of the 51st International Conference on Parallel Processing. 1–13.
  • Zhou et al. (2022) Yanqi Zhou, Xuanyi Dong, Tianjian Meng, Mingxing Tan, Berkin Akin, Daiyi Peng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, and James Laudon. 2022. Towards the co-design of neural networks and accelerators. Proceedings of Machine Learning and Systems 4 (2022), 141–152.