Skip to main content

How attention offloading reduces the costs of LLM inference at scale

spaceship light speed
Credit: VentureBeat made with OpenAI DALL-E 3

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Rearranging the computations and hardware used to serve large language models (LLMs) can considerably reduce the costs of inference, according to a new study by researchers at Tsinghua University. The study introduces “attention offloading,” a technique that uses lower-priced GPUs to handle memory-intensive operations while reserving the more expensive, compute-optimized accelerators for other tasks.

With high-end AI accelerators being expensive, scarce, and in high demand, techniques such as attention offloading can help companies make better use of their available hardware when serving LLMs at scale.

Two types of computations

LLM inference is a complicated process that involves different types of operations. The key to optimizing inference is to arrange these operations in a way that makes the best use of the memory and compute resources of the hardware accelerators.

From a resource perspective, the operations that take place during inference fall into two main categories. Some of them are computation-bound and can benefit from faster accelerators such as A100 and H100. Others, however, are memory-bound, which means they just need more video RAM (VRAM) capacity. This is particularly true for the self-attention operation that takes place for each new token generated by the model.

“This memory-bound workload disagrees with the inherent strengths of modern accelerators, resulting in a scenario where the memory controllers are overwhelmed while the powerful computation cores remain idle,” the researchers write.

The mismatch between memory and compute resources becomes more accentuated as the sequence length grows longer, such as when users write longer prompts or have longer conversations with the model, which is a common scenario in real-world applications.

Attention offloading

Current solutions mostly focus on scaling homogeneous architectures of high-end flagship accelerators for inference. For example, companies purchase more and more H100 processors to build bigger clusters for inference. This results in exploding costs and non-optimal use of hardware.

“Our research suggests that the unique characteristics of the LLM generation phase call for a heterogeneous architecture for better efficiency and lower cost,” the researchers write.

The study suggests that each type of accelerator is suited for particular aspects of LLM inference. For example, consumer-grade GPUs are very cost-effective and are suitable for memory-intensive operations. They can provide three times more memory capacity and bandwidth per dollar compared to high-end accelerators. However, given their limited compute power, solely relying on consumer-grade GPUs for serving LLMs can be inefficient. Companies will still need high-end accelerators.

However, the attention computation is highly parallelizable. Therefore, it can be distributed across multiple low-cost, memory-optimized devices.

“Attention offloading,” the technique proposed in the paper, involves creating two pools of accelerators, one optimized for computational power and the other for memory bandwidth efficiency. The attention computation is performed by low-cost, memory-efficient GPUs while the high-end accelerators are allocated to other operations.

Attention offloading
Attention offloading architecture (source: arxiv)

“Adopting this heterogeneous architecture allows us to design a serving system that flexibly delivers the three essential components (i.e., computational power, memory capacity and bandwidth) for high-performance LLM inference in a cost-efficient manner,” the researchers write.

This architecture aligns the resource demands of different LLM inference operations with the strengths of different hardware. This way, you can spend your budget on a combination of compute- and memory-optimized accelerators, getting more memory and bandwidth than if you only purchased high-end accelerators.

The researchers explore different challenges of the heterogeneous architecture, including bandwidth requirements for interconnecting the two pools of accelerators.

“Our findings reveal that not only conventional system buses such as PCIe 4.0 could meet our needs, networking technologies like 200Gb Infiniband or even Ethernet, already widely deployed in current AI-oriented data centers nowadays, also suffice,” the researchers write.

They also use different scheduling and pipelining techniques to minimize the latency caused by the non-uniform architecture. Their system ensures that memory and compute resources are engaged simultaneously and not blocked by the sequential computations of a single inference batch.

Lamina

According to their paper, the researchers developed Lamina, a distributed heterogeneous LLM inference system with attention offloading. 

Lamina uses consumer GPUs for storing the computed attention values, also known as the “KV cache,” and computing the attention operator. It uses high-end accelerators to store model parameters and compute other inference operations. These devices can be co-located within the same physical machine or distributed across a cluster of nodes. 

By offloading the KV cache storage and attention computation to memory devices, Lamina can handle 10.7–64X larger batches than vLLM, a popular LLM serving platform. This capability helps Lamina make better use of expensive computation-optimized accelerators, especially when serving LLMs at very large scales and on many batches.

lamina vs vllm
Lamina throughput compared to vLLM (source: arxiv)

“Experimental results on 13B and 33B models show that our system can achieve up to 1.48X–12.1X higher throughput per cost than existing solutions,” the researchers write.

As LLMs become commoditized, companies that serve models will need new ways to reduce the costs of inference and capital expenditure on accelerators, which is what attention offloading achieves. The researchers have not released the code for Lamina yet, but the concept is clearly laid out and, like other similar papers, is likely to be quickly implemented by the open source community.