ARCO:Adaptive Multi-Agent Reinforcement Learning-Based Hardware/Software Co-Optimization Compiler for Improved Performance in DNN Accelerator Design

Arya Fayyazi University of Southern CaliforniaLos AngelesCAUSA afayyazi@usc.edu , Mehdi Kamal University of Southern CaliforniaLos AngelesCAUSA mehdi.kamal@usc.edu and Massoud Pedram University of Southern CaliforniaLos AngelesCAUSA pedram@usc.edu

Abstract.

This paper presents ARCO, an adaptive Multi-Agent Reinforcement Learning (MARL)-based co-optimizing compilation framework designed to enhance the efficiency of mapping machine learning (ML) models - such as Deep Neural Networks (DNNs) - onto diverse hardware platforms. The framework incorporates three specialized actor-critic agents within MARL, each dedicated to a distinct aspect of compilation/optimization at an abstract level: one agent focuses on hardware, while two agents focus on software optimizations. This integration results in a collaborative hardware/software co-optimization strategy that improves the precision and speed of DNN deployments. Concentrating on high-confidence configurations simplifies the search space and delivers superior performance compared to current optimization methods. The ARCO framework surpasses existing leading frameworks, achieving a throughput increase of up to 37.95% while reducing the optimization time by up to 42.2% across various DNNs.

^†^†copyright: none

1. Introduction

The computational demands and complexity of deep neural networks (DNNs) have surged rapidly with their expanding applications across diverse industry sectors, including autonomous driving, medical imaging, and natural language processing (Hossain et al., 2023). These applications require high accuracy and real-time responsiveness, placing significant pressure on the computational infrastructure. Historically, performance optimization in environments like TensorFlow (Abadi et al., 2016) and PyTorch (Paszke et al., 2019) has relied on hand-optimized kernels such as NVIDIA’s cuDNN and Intel’s MKL. While these kernels are designed to optimize efficiency for particular hardware setups, they face challenges in adapting to the changing demands of modern neural architectures and their evolving computational requirements. The extensive number of parameters that need tuning and a huge hardware configuration search space pose significant hurdles (Ahn et al., 2020).

To address this growing complexity, there has been a shift towards automated compilation frameworks like TVM (Chen et al., 2018), TensorComprehensions (Vasilache et al., 2018), and AutoTVM (Chen et al., 2018). These frameworks move away from static, hand-tuned optimization strategies to dynamic, algorithm-driven approaches capable of adapting to various architectures and operational requirements. For example, AutoTVM utilizes a sophisticated approach involving boosted trees (Chen and Guestrin, 2016) to navigate the extensive configuration space of neural network code efficiently. This method predicts the performance of potential configurations and minimizes the need for exhaustive empirical testing, thereby expediting the optimization process. However, despite these technological advances, optimizing complex models requires substantial resources and compilation time, underscoring the ongoing challenges and gaps in current optimization methodologies. In addition to evolutionary strategies like genetic algorithms (Mirjalili and Mirjalili, 2019), machine-learning techniques have been instrumental in refining the search for optimal configurations across this complex landscape. When considering various machine learning algorithms, reinforcement learning (RL) stands out as a promising option for exploring design search spaces and optimizing tasks (Zhang et al., 2022; Wang et al., 2022; Bakshi and Johnsson, 2023; Ahn et al., 2020). RL operates by testing various strategies and utilizing feedback on their actions to maximize cumulative rewards. This learning process occurs through agents interacting with their environment, which can range from single to multi-agent setups. Multi-agent RL offers the advantage of enabling transfer learning and providing valuable insights into interconnected decision-making systems (Du and Ding, 2021).

To address these inefficiencies and push the boundaries of current compilation technologies, we propose an adaptive multi-agent reinforcement learning-based hardware/software co-optimization compiler called ARCO. At its core, this optimizing compiler relies on a multi-agent reinforcement learning (MARL) algorithm to optimize the mapping of the given model and the accelerator structure concurrently. ARCO utilizes three specialized agents—each designed to optimize different facets of the system architecture: two focus on software and DNN configurations, while the third optimizes the accelerator architecture. This actor-critic multi-agent approach facilitates a more integrated and holistic optimization process and allows for more granular control and customization of the optimization strategies employed.

The proposed framework efficiently narrows the vast search space by employing a soft threshold for selecting high-confidence configurations, focusing only on the most promising candidates. This methodology not only reduces computational overhead but also minimizes exploration time. Our method leverages the capabilities of VTA++ (Banerjee et al., 2021), a highly configurable deep learning accelerator, to investigate diverse architectural optimizations. ARCO utilizes Centralized Training with a Decentralized Execution (CTDE) strategy, wherein training occurs in a centralized manner. In contrast, execution is decentralized, enabling adaptive adjustments to the changing requirements of DNN workloads. This strategic approach efficiently narrows the extensive search space by utilizing a soft threshold to select high-confidence configurations, prioritizing the most promising candidates. The efficacy of the ARCO is assessed under a set of DNN models with a variety of architectures and compared with leading state-of-the-art approaches, and it has demonstrated substantial improvements in throughput.

The rest of the paper is organized as follows. Section 2 provides detailed background on the VTA++, CTDE in MARL, workflow for DNN compilers, and related works that have influenced our approach. ARCO and its detailed architecture and components are discussed in Section 3. In Section 4, the efficacy of the ARCO is assessed and compared to that of the established frameworks such as AutoTVM and CHAMELEON (Ahn et al., 2020). Finally, the paper concludes in Section 5.

2. Background

Refer to caption — Figure 1. High-Level architecture of VTA++.

This section briefly introduces the target hardware we used (i.e., VTA++), the Centralized Training and Decentralized Execution in Multi-Agent Reinforcement Learning, the workflow for DNN compilers, and related prior work.

2.1. VTA++

The Versatile Tensor Accelerator (VTA) (Moreau et al., 2019) has emerged as a highly effective deep learning hardware acceleration framework integrated within the TVM stack. To achieve a good hardware/software co-design, we need highly configurable target hardware, which we could optimize its architecture to achieve simultaneous hardware and software optimization. Therefore, we chose VTA++ (Banerjee et al., 2021), which has the same architecture as VTA but is more customizable. The high-level architecture of VTA++ is shown in Fig 1.

The General Matrix Multiply (GEMM) core is the cornerstone of VTA’s capabilities, carrying out dense matrix multiplications. This key operation underpins deep learning functions like 2D convolutions using convolutional layers and classification based on the extracted features using fully connected layers. This core is specifically engineered to handle complex arithmetic calculations by multiplying input and weight tensors and then adding the output to a register file tensor.

The GEMM core of VTA is defined by key hardware parameters: BATCH, BLOCK_IN, and BLOCK_OUT. BATCH size refers to the number of data samples processed in parallel, BLOCK_IN represents the inner dimension of the input tensor, and BLOCK_OUT pertains to the output dimension of the weight tensor and the accumulated result.

In this paper, we dedicate one of our actor-critic MARL agents to optimize the VTA GEMM core for enhanced hardware performance. By selectively tuning BATCH, BLOCK_IN, and BLOCK_OUT—referred to herein as ”hardware knobs”—we adapt the GEMM core to be optimized in our method. Altering BATCH size impacts the parallelism of data processing while modifying BLOCK_IN and BLOCK_OUT influences the granularity of the computation and the utilization of the on-chip memory resources. The goal is to strike an optimal balance between computational throughput and resource allocation, thereby maximizing the GEMM core’s efficiency. This RL agent learns to predict the optimal hardware configuration by interacting with the VTA environment and other software agents, thereby learning from the empirical performance outcomes of previous configurations.

2.2. CTDE in MARL

Multi-agent reinforcement learning (MARL) extends traditional reinforcement learning, introducing multiple agents within a shared environment. The complexity of agent interactions calls for sophisticated methodologies that coordinate individual agent strategies with collective objectives. The Centralized Training with Decentralized Execution (CTDE) framework has emerged as a significant paradigm in MARL, balancing the collective intelligence during training with the autonomy of individual agents during operation (Lyu et al., 2021).

During the centralized training phase within the CTDE paradigm, agents have access to a global state and learn policies that consider all agents’ collective state and actions. This centralized approach allows agents to develop cohesive strategies, effectively addressing cooperative, collaborative, competitive, and mixed tasks. The essence of CTDE is encapsulated in the adaptation of Proximal Policy Optimization (PPO) (Schulman et al., 2017) to MARL, leading to the development of Multi-Agent PPO (MAPPO) (Yu et al., 2022).

In the context of MAPPO, the learning process involves three pivotal components: critic learning, General Advantage Estimation (GAE), and policy learning. These components are encapsulated in the following mathematical expressions:

•

Critic learning ensures that each iteration yields an enhanced centralized value function. It is defined as:

(1)

\phi_{k+1}=\arg\min_{\phi}\frac{1}{|D_{k}|T}\sum_{\tau\in D_{k}}\sum_{t=0}^{T}% \left(V_{\phi}(\mathbf{o}{t},\mathbf{s}{t},\mathbf{u}{t})-\hat{R}{t}\right)^{2}

•

General Advantage Estimation (GAE) gauges the quality of the current action relative to the baseline critic value, described by:

(2)

A_{t}=\sum_{t=0}^{\infty}(\gamma\lambda)^{t}\delta^{V}_{t+1}

•

Policy learning involves computing the policy gradient using the estimated advantage to update the policy function:

(3)		$\displaystyle L(\mathbf{o},\mathbf{s},\mathbf{u},\mathbf{u}^{\prime},\theta_{k% },\theta)=$	$\displaystyle\min\bigg{(}\frac{\pi_{\theta}(\mathbf{u}\|\mathbf{o})}{\pi_{% \theta_{k}}}(\mathbf{u}\|\mathbf{o})A^{\pi_{\theta_{k}}}(\mathbf{o},\mathbf{s},% \mathbf{u}^{\prime}),$
(3)			$\displaystyle\text{clip}\left(\frac{\pi_{\theta}(\mathbf{u}\|\mathbf{o})}{\pi_{% \theta_{k}}},1-\epsilon,1+\epsilon\right)A^{\pi_{\theta_{k}}}(\mathbf{o},% \mathbf{s},\mathbf{u}^{\prime})\bigg{)}$

where $D$ represents the trajectories that can be shared across agents, $\hat{R}$ is the estimated reward, $\tau$ denotes a trajectory, and $A$ is the advantage. The discount factor $\gamma$ and the weight of the GAE $\lambda$ govern the temporal scope of the advantage estimation. In these expressions, $\mathbf{u}$ denotes the current agent action, while $\mathbf{u}^{\prime}$ encapsulates the action set of all agents except the acting agent. The global state is represented by $\mathbf{s}$ , and $\mathbf{o}$ signifies the local observation of an agent. The hyperparameter $\epsilon$ controls the policy update’s deviation from the previous policy. $V_{\phi}$ is the shared value function, and $\pi_{\theta}$ denotes the shared policy function, highlighting the collaborative essence of CTDE in MARL.

Empirical studies reveal that MAPPO can rival and, at times, outperform competitive off-policy algorithms, which have traditionally been preferred for their sample efficiency (Yu et al., 2022). These findings led us to use MAPPO as our optimization algorithm used in MARL.

2.3. Workflow for DNN compilers

Mapping DNN models into efficient machine code is crucial for their performance across various hardware platforms. This mapping is managed by a compiler workflow that optimizes the DNN model in several stages (Allen and Kennedy, 2002). Initially, the compiler’s frontend conducts general optimizations. These broad improvements do not yet account for the specific hardware where the model will run. Subsequently, backend optimizations consider the target hardware’s characteristics, shaping the code in a hardware-aware manner. Recent advancements like the AutoTVM system have introduced an innovative step that fine-tunes the code based on actual hardware performance feedback. This method explores various settings—referred to as ”knobs”—to determine the most effective configuration for the code to run efficiently. The search for this optimal configuration ( $\Theta^{*}$ ) is captured by:

\Theta^{*}=\underset{\Theta\in D}{\text{argmax}}\left\{f\left[\tau(\Theta)% \right]\right\},

where the term $f$ represents the performance or ’fitness’ of the code, and $\Theta$ stands for a specific set of knob settings. The compiler’s goal is to discover the knob settings that maximize the code’s performance on the target hardware within the allowed configuration space $D$ .

2.4. Related Work

In the domain of DNN compilers, adaptive and efficient auto-tuning mechanisms have proven critical for optimizing performance on varying hardware architectures (Rieber et al., 2022; Ryu et al., 2022; Lu et al., 2022; Fayyazi et al., 2024; Dhakal et al., 2022). An interesting development in this field is AutoTVM (Chen et al., 2018), which utilizes a machine learning-based cost model, specifically XGBoost (Chen and Guestrin, 2016), for its auto-tuning process. This method, while effective, involves a large search space and substantial measurement overhead, making optimization resource-intensive.

MetaTune (Ryu and Sung, 2021) was introduced to address the limitations of black-box auto-tuning by leveraging meta-learning to allow its cost model to ”learn to learn” performance correlations. This enables MetaTune (Ryu and Sung, 2021) to adapt to new optimization spaces quickly and efficiently, demonstrating improvements in inference times compared to prior methods. Glimpse (Ahn et al., 2022) builds upon these approaches by integrating mathematical embeddings of hardware specifications, known as Blueprints, in a Bayesian optimization framework. These Blueprints guide the search algorithm towards subspaces with higher performance potential.

Another approach is CHAMELEON(Ahn et al., 2020), which employs reinforcement learning for its Adaptive Exploration and Sampling algorithms, minimizing invalid configurations and costly hardware measurements. CHAMELEON(Ahn et al., 2020) significantly reduces optimization time compared to AutoTVM while improving inference times. NaaS (Neural Architecture and Accelerator Search) (Zhou et al., 2022) expands upon these frameworks by jointly optimizing neural network architectures and hardware accelerators. It parameterizes both architectures in a unified search space using PyGlove and demonstrates efficiency gains across tasks like image classification and segmentation.

PRIME (Data-Driven Offline Optimization for Architecting Hardware Accelerators) (Kumar et al., 2021) introduces a novel data-driven method for designing hardware accelerators using logged simulation data to construct a robust, conservative surrogate model. By eliminating simulation-driven optimization’s repeated overhead, PRIME significantly reduces simulation time while architecting specialized accelerators for single- and multi-application tasks.

Comparing these methods highlights AutoTVM’s groundwork in machine learning-based auto-tuning. MetaTune and Glimpse advance this by incorporating meta-learning and hardware-aware strategies, respectively. CHAMELEON extends the reinforcement learning approach with improved sampling methods. NaaS and PRIME offer new dimensions by integrating accelerator co-design and offline data-driven optimization. However, despite the advances brought by NaaS and PRIME in hardware consideration, their lack of reinforcement learning makes them slower in compilation times, as they can’t rapidly navigate the expansive hardware and software search space. Reinforcement learning algorithms, as used in CHAMELEON, enable compilers to navigate possible configurations efficiently, identifying high-performance options with fewer tests. This results in a more agile optimization process, allowing for quicker deployment of DNN models and facilitating rapid progress in deep learning research and applications(Zhang et al., 2022; Wang et al., 2022; Bakshi and Johnsson, 2023; Ahn et al., 2020).

In comparing our approach to existing methods like CHAMELEON, Glimpse, and MetaTune—which focus on reducing compilation times through different optimization techniques—and co-design methods like NaaS and PRIME—which strive to enhance system throughput through various optimization strategies—our analysis underscores the distinct advantages of MARL in efficiently navigating the breadth and depth of the design space in modern DNN applications. While effective in their own ways, these methods often fail to fully capture the interdependencies between hardware and software optimizations that our framework does, which can result in suboptimal performance and inefficient throughput in real-world applications. Additionally, co-design optimizers like NaaS and PRIME, though aiming to streamline throughput, lack the capacity to effectively reduce optimization time within the vast hardware-software search space and require significant resources for compiling designs.

3. ARCO

As explained before, current systems fall short of effectively optimizing both the hardware and software components of DNNs simultaneously, as they tend to prioritize software parameters exclusively. Also, existing co-design approaches are too time-consuming to compile. To break free from prolonged optimization processes and fragmented tuning methods, it is important to take on two key challenges: 1) Improving the search algorithm to encompass hardware architecture and parameters as modifiable factors and 2) Refining the sampling process to capture the solution distribution better and minimize encountering non-feasible configurations.

We propose two key advancements in the co-optimizing compiler for DNNs to address these obstacles. Firstly, we incorporate a multi-agent reinforcement learning strategy into the search algorithm, allowing it to adapt to new design environments while simultaneously refining hardware and software parameters. This approach harnesses collective intelligence to achieve a more holistic and efficient optimization approach. Secondly, we introduce a novel Confidence Sampling (CS) method to replace the existing uniform (Chen et al., 2018) and adaptive sampling methods (Ahn et al., 2020), which often are random and rely on trial and error. The proposed CS method identifies configurations with a higher probability of success, guided by learned patterns from previous successful optimizations.

3.1. Overview of ARCO

Figure 2 depicts the architecture of our sophisticated co-optimizing compiler, termed ARCO, which elucidates the process of optimization compilation. ARCO initiates with a code template, $\tau$ , for each layer of the neural network and a corresponding design space $D_{\Theta}$ , delineating the range of possible configurations. The compiler then engages in an optimization process, navigating through the configuration space $\Theta$ to ascertain the optimal code configuration, represented as $\tau(\Theta^{*})$ .

ARCO maneuvers through the design space, employing a cost model as a surrogate for direct hardware measurements. The MARL Exploration module generates an initial set of candidate configurations $S_{\Theta}$ . Through the CS method, ARCO refines $S_{\Theta}$ to produce a more focused subset $S^{\prime}_{\Theta}$ , which contains a condensed yet highly promising set of configurations. The configurations in $S^{\prime}_{\Theta}$ are then passed to the MARL Code Generator module, which incorporates them with the input template $\tau$ to create a series of potential executable codes $\tau(\Theta)$ . These are subsequently deployed on the hardware for empirical runtime evaluations. The hardware runtimes provide a measure of the configurations’ fitness, captured by a fitness function $f$ , which informs the update of the cost model and enhances the exploration in subsequent iterations. After several iterations, the process converges to some $\tau(\Theta^{*})$ that achieves optimal fitness, characterized by the shortest hardware execution runtime. This configuration $(\Theta^{*})$ is then selected as the output for that network layer.

3.2. MARL Exploration

In our work, we implement a MARL Exploration module (As shown in Figure 3) that employs a Multi-Agent Reinforcement Learning (MARL) strategy designed to optimize the configurations of DNN architectures and hardware simultaneously. This module utilizes the principle of Centralized Training and Decentralized Execution (CTDE), allowing agents to learn collaboratively while operating independently during the execution phase to maximize the fitness function, $f$ , of configurations within the configuration space, $S_{\Theta}$ . MARL’s ability to manage complex decision-making environments makes it particularly suitable for navigating the high-dimensional configuration space of DNNs.

The MARL Exploration strategy employs three agents, each equipped with a policy network, as detailed in Table 1. Additionally, there is a centralized critic network (value network) that, along with the policy networks, is implemented as a Multi-Layer Perceptron (MLP) neural network. The policy network directs each agent to propose adjustments to the configuration knobs (as shown in Table 2) within its assigned portion of the design space. In contrast, the value network estimates the potential value of these adjustments. A shared central critic aids in evaluating the global state, facilitating collective learning during the training phase, while each agent makes decisions independently during execution, adhering to the CTDE paradigm.

Table 1. Roles and Responsibilities of Different Agents

Agent Type

Description

Software Agent 1

(Scheduling Agent)

This agent focuses on how tasks are parallelized and distributed, which is crucial for effective scheduling. Scheduling tasks typically involve deciding how computations are assigned and executed in parallel threads.

Software Agent 2

(Mapping Agent)

This agent deals with how data (namely, tensor height and width) is divided and processed. The focus on specific data dimensions makes this agent suitable for determining how computations map onto the hardware, dealing with spatial data distribution.

Hardware Agent

This agent adjusts parameters that directly influence how different components of the tensors (batches, input channels, and output channels) are divided and processed by the hardware. This suggests a deep focus on optimizing the hardware utilization based on the hardware’s capabilities and limitations (namely, resource counts and performance levels.)

The optimization unfolds through several episodes, each consisting of multiple search steps. During these steps, the agents evaluate the current configuration and independently decide on actions to improve subsequent configuration fitness, guided by shared insights and individual observations. The process detailed in Algorithm 1 outlines how each agent interacts within the CTDE framework, iterating through configurations to enhance the overall network performance.

Algorithm 1 CTDE MARL Exploration for DNN Configuration Optimization

1:Initialize centralized critic and individual policy networks for each agent

2:for each episode do

3: Initialize a set of configurations,

S_{\Theta}

4: for each search step in the episode do

5: for each agent do

6: Observe the current configuration,

\Theta

, and receive shared insights from the centralized critic

7: Independently chooses an action based on the policy network and local observations

8: Apply the action to update the configuration

9: Estimate the new configuration’s local value

10: end for

11: Collectively evaluate all new configurations using the cost model

12: Update the centralized critic based on global performance feedback

13: Individually update each agent’s policy network based on local and shared feedback

14: end for

15: Determine the configuration with the highest fitness from collective insights

16:end for

17:Return the optimal configuration,

\tau(\Theta^{*})

Each episode integrates the evaluations from a cost model that serves as a surrogate for direct hardware performance measurements. This model’s feedback is used to update the centralized critic, enhance the shared knowledge base, and refine the individual policy networks, refining each agent’s decision-making capabilities based on global and local feedback. As the episodes progress, the MARL Exploration Module becomes increasingly adept at identifying configurations that yield the best performance, ultimately converging on $\tau(\Theta^{*})$ with the optimal fitness, $f$ .

Applying this CTDE framework within the MARL context, coupled with the PPO optimization algorithm, represents a significant advancement in the auto-tuning of DNN and hardware configurations. This approach systematically improves the configuration search process, significantly enhancing DNN performance while reducing the computational overhead.

3.2.1. Cost Model and Central Critic Update

In the ARCO framework, the optimization goal is to improve throughput, with the cost model reflecting the inverse of execution time. The central critic, a key component of our MARL setup, updates its policy based on the aggregated feedback from all agents, continuously optimizing the global state evaluation. The critic’s learning process is guided by the update rule, which is mathematically expressed in section 2.2, which minimizes the mean squared error between the predicted values and the actual rewards obtained.

3.2.2. Incorporating Hardware and Software Constraints

To integrate constraints related to hardware, such as area limitations, or software, such as memory usage, into the MARL framework, a penalty term can be incorporated into the reward function. This approach adjusts the reward based on the degree to which the constraints are violated, effectively guiding the agents to prefer configurations that adhere to these constraints. For instance, a penalty function $P$ can be defined as follows:

	$\displaystyle P(\Theta)=$	$\displaystyle\lambda\left(\max(0,\text{area}(\Theta)-\text{area}_{\text{max}})\right.$
(4)			$\displaystyle\left.+\max(0,\text{memory}(\Theta)-\text{memory}_{\text{max}})\right)$

where $\lambda$ is a scaling factor that adjusts the impact of the penalty, $\text{area}(\Theta)$ and $\text{memory}(\Theta)$ are the area and memory usage of the configuration $\Theta$ , and $\text{area}_{\text{max}}$ and $\text{memory}_{\text{max}}$ are the respective maximums.

The reward function, modified to include this penalty, becomes:

(5)

R_{t}=\frac{1}{\text{execution time}(\Theta)}-P(\Theta)

This modification ensures that the MARL agents not only seek to optimize performance but also adhere to specified design constraints, balancing efficiency with practical deployment considerations.

Table 2. Knobs in the design space to optimize convolution layers (These knobs make a search space as big as O(

2^{12}

))

Agent Type	Knobs
Hardware Design Agent (Hardware Optimizer)	tile_b, tile_ci, tile_co
- Tile across batch size (tile_b)
- Tile across input channels (tile_ci)
- Tile across output channels (tile_co)
Scheduling Agent (Software Optimizer)	h_threading, oc_threading
- Horizontal threading (h_threading)
- Output channel threading (oc_threading)
Mapping Agent (Software Optimizer)	tile_h, tile_w
- Tile across height (tile_h)
- Tile across width (tile_w)

3.3. Confidence Sampling

In the domain of DNN optimization, the ability to efficiently sample configuration spaces without costly hardware measurements is essential for rapid compilation. We introduce the Confidence Sampling (CS) method. This technique is designed to leverage the probability distribution of configurations’ fitness judiciously to select a subset likely to yield high-performing network configurations.

Although the Confidence Sampling method operates on the principles of probability-guided selection and shares some conceptual similarities with Importance Sampling (IS) (Tokdar and Kass, 2010) in statistical methods, it is distinct in its purpose and application. In IS, samples are re-weighted based on their probability density to approximate expected values more efficiently. Confidence Sampling, however, specifically focuses on reducing the number of hardware measurements required when exploring the design space. By evaluating the configurations’ values, this method synthesizes a subset that not only encapsulates the diversity of the configuration space but also emphasizes the high-confidence regions, i.e., areas where the likelihood of encountering an optimal or near-optimal configuration is high.

The core of the CS method lies in its algorithmic approach to filtering and enhancing the set of configurations for subsequent optimization iterations. To formalize this process, we introduce Algorithm 2, which captures the essence of the CS method within the optimization pipeline. The process involves the following steps:

(1)

Evaluate Configurations: Each configuration’s value is estimated using the output of the value network (critic network) to construct a probability distribution over the configuration space. (Line 2)
(2)

Probability-Guided Selection: Configurations are sampled based on this distribution, prioritizing those with higher estimated values. (Line 3-4)
(3)

Confidence Assessment: A dynamic thresholding mechanism is applied to determine the confidence level of each selected configuration. (Line 5 )
(4)

Synthesis of Configurations: Lower-confidence configurations are replaced with synthesized ones by combining each parameter’s most frequently occurring settings across the sampled configurations. (Line 6-7)

Algorithm 2 Confidence Sampling for DNN Optimization

1:procedure ConfidenceSampling(

\mathcal{S}_{\Theta}

value\_network

N_{select}

)

\mathcal{V}_{preds}\leftarrow value\_network.predict(\mathcal{S}_{\Theta})

\triangleright

Estimate values for all configurations

\mathcal{P}\leftarrow softmax(\mathcal{V}_{preds})

\triangleright

Convert values to a probability distribution

\mathcal{C}_{selected}\leftarrow\textsc{SelectConfigurations}(\mathcal{P},N_{% configs})

threshold\leftarrow\textsc{ComputeDynamicThreshold}(\mathcal{V}_{preds})

\triangleright

Compute an adaptive threshold

\mathcal{C}_{high\_conf}\leftarrow\{c\in\mathcal{C}_{selected}\mid\mathcal{V}_% {preds}[c]>threshold\}

7: return

\mathcal{C}_{high\_conf}

8:end procedure

9:function SelectConfigurations(

\mathcal{P}

N_{configs}

)

10:

indices\leftarrow

sample indices from

[1,\ldots,|\mathcal{P}|]

with probability

\mathcal{P}

11: return

\mathcal{S}_{\Theta}[indices[1:N_{configs}]]

12:end function

13:function ComputeDynamicThreshold(

\mathcal{V}_{preds}

)

14: return

median(\mathcal{V}_{preds})

15:end function

As part of the larger system, the CS method interacts closely with the exploration and exploitation mechanisms to ensure that the exploration is directed toward the most promising regions of the design space. By applying the CS method, we have observed a reduced number of required configurations, as shown in Figure 4. Moreover, we can observe the sampling process gravitates towards configurations that demonstrate superior performance over time, highlighting the efficacy of this method.

4. Experimental Results

4.1. Experimental Setup

We have incorporated ARCO within the TVM framework to thoroughly evaluate its components, contrasting its performance with that of AutoTVM and CHAMELEON. This integration allows for a comprehensive, end-to-end assessment of ARCO across a suite of advanced deep-learning models, including AlexNet, VGG-11, VGG-13, VGG-16, VGG-19, ResNet-18, and ResNet-34. All models were benchmarked using the same AutoTVM compilation duration to ensure a rigorous and equitable comparison. Further, we have employed the source code of the CHAMELEON (Ahn et al., 2020) to compare its efficacy with AutoTVM and ARCO. As mentioned above, AutoTVM and CHAMELEON do not support hardware configuration exploration; thus, in the case of these two exploration techniques, we have considered the default specification values of the VTA++ (Banerjee et al., 2021). Performance tests were conducted on a robust platform powered by a 3.4 GHz AMD EPYC 7763 64-Core Processor, providing a stable and powerful environment for these evaluations. As discussed before, we used the VTA++ simulator as our target hardware.

Table 3 shows the details of the DNN models employed for the comparison studies. In our setup, each agent is equipped with its own policy network, and a centralized value network functions as the critic for all agents. Both networks are implemented as Multi-Layer Perceptrons (MLPs) using TensorFlow (Abadi et al., 2016) that their details are:

•

Policy Network: Each agent’s policy network consists of a single hidden layer with 20 neurons using ReLU activation functions. The output layer uses softmax activation to generate a probability distribution over actions, allowing for decision-making informed by learned policies.
•

Value Network: The centralized critic, or value network, employs three hidden layers with 20 neurons each, using tanh activation functions. This design helps stabilize learning by providing a continuous estimate of state values, which guides collective policy adjustments.

Table 3. Details of the DNN models used in evaluating ARCO (All of the models are extracted from Mxnet (Chen et al., 2015)).

Network	Dataset	Number of Convolution Tasks
AlexNet(Krizhevsky et al., 2012)	ImageNet	5
VGG-11(Simonyan and Zisserman, 2014)	ImageNet	8
VGG-13(Simonyan and Zisserman, 2014)	ImageNet	10
VGG-16(Simonyan and Zisserman, 2014)	ImageNet	13
VGG-19(Simonyan and Zisserman, 2014)	ImageNet	16
ResNet-18(He et al., 2016)	ImageNet	17
ResNet-34(He et al., 2016)	ImageNet	33

Hyper-parameter tuning is crucial in optimizing the performance of machine learning tools and models (Yu and Zhu, 2020). To facilitate transparency and reproducibility, we have detailed the hyper-parameters utilized in our evaluation in Table 4. These hyper-parameters were meticulously tuned offline to optimize model performance, and they are the same as those used in CHAMELEON (Ahn et al., 2020). For the hyper-parameters listed in Table 5, we adhered to the values specified in the AutoTVM study to maintain consistency and ensure a fair comparison. Similarly, we adopted the hyper-parameter settings from the MAPPO paper (Yu et al., 2022) for the MARL exploration module, aligning our methodologies with established research to validate our findings effectively.

Table 4. Hyper-parameters (HPs) used in ARCO.

Value

Description

iteration_opt

Total iterations for the optimization cycle

(equivalent to 1000 hardware measurements)

modeGBT

xgb-reg

Loss function type utilized in the cost model

bGBT

Planning’s maximum batch size in GBT (Chen et al., 2018)

episode_rl

128

Number of episodes for the RL process

step_rl

500

Maximum # of steps in a single RL episode

Table 5. Hyper-parameters (HPs) used in AutoTVM

HP	Value	Description
$\Sigma(b_{GBT})$	1000	Total count of hardware measurements
$mode_{GBT}$	xgb-reg	Loss function type for the cost model
$b_{GBT}$	64	Batch size for planning in GBT (Chen et al., 2018)
$n_{sa}$	128	Count of Markov chains in parallel SA^∗
$step_{sa}$	500	Maximum # of steps in a single SA run

^∗

SA: Simulated Annealing.

4.2. End-to-end Evaluation

Figure 5 Compares the achieved throughput of different frameworks over AutoTVM. As the results illustrate, the significant reduction in mean inference time (advancement in throughput) achieved by ARCO compared to AutoTVM and CHAMELEON for various DNN models: AlexNet, VGG-11, VGG-13, VGG-16, VGG-19, and ResNet-18, and ResNet-34. On average, our methodology demonstrates a 1.17× improvement in throughput. Table 6 reports detailed numerical results supporting these findings.

Table 6. Mean inference times for different frameworks on VTA++ (in seconds)

Model	AutoTVM	CHAMELEON	ARCO
ResNet-18	$1.730\,61$	$1.595\,95$	1.25448
ResNet-34	$3.634\,09$	$3.537\,80$	2.66561
AlexNet	$0.541\,16$	$0.589\,81$	0.42160
VGG11	$4.952\,29$	$4.212\,83$	4.20685
VGG13	$6.192\,89$	$6.163\,50$	6.03418
VGG16	$8.558\,78$	$8.730\,89$	8.34791
VGG19	$11.291\,50$	$11.288\,39$	11.27800

While achieving improvement in throughput, ARCO also needs less compilation time compared to AutoTVM; Figure 6 shows that ARCO speeds up the optimization time up to 42.2%. Figure 7 shows ARCO’s comparative output code performance relative to other frameworks. Notably, ARCO achieves convergence to the peak GFLOPS value, equivalent to those reached by AutoTVM and CHAMELEON (without adaptive sampling), but with greater efficiency. This is accomplished with fewer hardware measurements and at a faster rate, underscoring the efficacy of the CS method.

The overall efficiency gains are primarily attributable to the sophisticated MARL exploration techniques and the reduced hardware measurements necessitated by implementing the CS method.

5. Conclusion

This paper introduced ARCO, an innovative multi-agent reinforcement learning-based compiler that significantly advances the field of DNN accelerator design. By integrating a multi-agent system with specialized roles—two agents focused on software optimization and one on hardware—ARCO addressed the intricacies of DNN architecture and effectively navigated the vast configuration space to enhance performance and efficiency. Employing the MARL Exploration module and Confidence Sampling method, ARCO reduced the computational overhead and expedited the optimization process while increasing the throughput. Our experimental results, conducted on advanced DNN models like AlexNet, VGG, and ResNet series, demonstrated that ARCO outperforms traditional frameworks like AutoTVM and CHAMELEON, improving throughput by up to 37.95% and reducing optimization time significantly (Up to 42.2%). These enhancements underscore the potential of adaptive, intelligent systems in optimizing hardware and software layers cohesively, setting a new benchmark for future developments in real-time DNN applications.

References

(1)
Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. $\{$ TensorFlow $\}$ : a system for $\{$ Large-Scale $\}$ machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265–283.
Ahn et al. (2022) Byung Hoon Ahn, Sean Kinzer, and Hadi Esmaeilzadeh. 2022. Glimpse: mathematical embedding of hardware specification for neural compilation. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 1165–1170.
Ahn et al. (2020) Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, and Hadi Esmaeilzadeh. 2020. Chameleon: Adaptive code optimization for expedited deep neural network compilation. arXiv preprint arXiv:2001.08743 (2020).
Allen and Kennedy (2002) Randy Allen and Ken Kennedy. 2002. Optimizing compilers for modern architectures: a dependence-baced approach. (No Title) (2002).
Bakshi and Johnsson (2023) Suyash Bakshi and Lennart Johnsson. 2023. Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning. ACM Transactions on Embedded Computing Systems 22, 5s (2023), 1–21.
Banerjee et al. (2021) Suvadeep Banerjee, Steve Burns, Pasquale Cocchini, Abhijit Davare, Shweta Jain, Desmond Kirkpatrick, Anton Sorokin, Jin Yang, and Zhenkun Yang. 2021. A highly configurable hardware/Software stack for DNN inference acceleration. arXiv preprint arXiv:2111.15024 (2021).
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
Chen et al. (2015) Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
Chen et al. (2018) Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. $\{$ TVM $\}$ : An automated $\{$ End-to-End $\}$ optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.
Dhakal et al. (2022) Aditya Dhakal, KK Ramakrishnan, Sameer G Kulkarni, Puneet Sharma, and Junguk Cho. 2022. Slice-tune: A system for high performance dnn autotuning. In Proceedings of the 23rd ACM/IFIP International Middleware Conference. 228–240.
Du and Ding (2021) Wei Du and Shifei Ding. 2021. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artificial Intelligence Review 54, 5 (2021), 3215–3238.
Fayyazi et al. (2024) Arash Fayyazi, Mahdi Nazemi, Arya Fayyazi, and Massoud Pedram. 2024. NeuroBlend: Towards Low-Power yet Accurate Neural Network-Based Inference Engine Blending Binary and Fixed-Point Convolutions. arXiv:2307.03784 [cs.AR]
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Hossain et al. (2023) Md Bipul Hossain, Na Gong, and Mohamed Shaban. 2023. Computational Complexity Reduction Techniques for Deep Neural Networks: A Survey. In 2023 IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings). IEEE, 1–6.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
Kumar et al. (2021) Aviral Kumar, Amir Yazdanbakhsh, Milad Hashemi, Kevin Swersky, and Sergey Levine. 2021. Data-driven offline optimization for architecting hardware accelerators. arXiv preprint arXiv:2110.11346 (2021).
Lu et al. (2022) Bingqian Lu, Zheyu Yan, Yiyu Shi, and Shaolei Ren. 2022. A Semi-Decoupled Approach to Fast and Optimal Hardware-Software Co-Design of Neural Accelerators. arXiv preprint arXiv:2203.13921 (2022).
Lyu et al. (2021) Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. 2021. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402 (2021).
Mirjalili and Mirjalili (2019) Seyedali Mirjalili and Seyedali Mirjalili. 2019. Genetic algorithm. Evolutionary algorithms and neural networks: Theory and applications (2019), 43–55.
Moreau et al. (2019) Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, et al. 2019. A hardware–software blueprint for flexible deep learning specialization. IEEE Micro 39, 5 (2019), 8–16.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
Rieber et al. (2022) Dennis Rieber, Moritz Reiber, Oliver Bringmann, and Holger Fröning. 2022. Hw-aware initialization of dnn auto-tuning to improve exploration time and robustness. arXiv preprint arXiv:2205.15568 (2022).
Ryu et al. (2022) Jaehun Ryu, Eunhyeok Park, and Hyojin Sung. 2022. One-shot tuner for deep learning compilers. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 89–103.
Ryu and Sung (2021) Jaehun Ryu and Hyojin Sung. 2021. Metatune: Meta-learning based cost model for fast and efficient auto-tuning frameworks. arXiv preprint arXiv:2102.04199 (2021).
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Tokdar and Kass (2010) Surya T Tokdar and Robert E Kass. 2010. Importance sampling: a review. Wiley Interdisciplinary Reviews: Computational Statistics 2, 1 (2010), 54–60.
Vasilache et al. (2018) Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018).
Wang et al. (2022) Huanting Wang, Zhanyong Tang, Cheng Zhang, Jiaqi Zhao, Chris Cummins, Hugh Leather, and Zheng Wang. 2022. Automating reinforcement learning architecture design for code optimization. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 129–143.
Yu et al. (2022) Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. 2022. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems 35 (2022), 24611–24624.
Yu and Zhu (2020) Tong Yu and Hong Zhu. 2020. Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689 (2020).
Zhang et al. (2022) Zining Zhang, Bingsheng He, and Zhenjie Zhang. 2022. HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler for Neural Networks. In Proceedings of the 51st International Conference on Parallel Processing. 1–13.
Zhou et al. (2022) Yanqi Zhou, Xuanyi Dong, Tianjian Meng, Mingxing Tan, Berkin Akin, Daiyi Peng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, and James Laudon. 2022. Towards the co-design of neural networks and accelerators. Proceedings of Machine Learning and Systems 4 (2022), 141–152.