Skip to main content

DeepMind researchers discover impressive learning capabilities in long-context LLMs

Created: VentureBeat using Midjourney
Created: VentureBeat using Midjourney

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


In a few years, large language models (LLMs) have gone from handling a few hundred words of input to several books’ worth of content at the same time. These expanded input capacities, also referred to as the “context window,” are enabling new applications and use cases that were previously impossible without extensive engineering efforts.

A new study by researchers at Google DeepMind explores the “many-shot” in-context learning (ICL) ability of LLMs that have very long context windows. Their findings show that by fitting hundreds or even thousands of training examples in the prompt, you can improve the model’s abilities in ways that would previously require fine-tuning.

Long-shot ICL can become an important tool for enterprises that want to quickly create and iterate over prototypes of LLM applications before optimizing them for scale.

Few-shot vs many-shot ICL

ICL enables LLMs to learn new tasks from examples provided at inference time. The LLM is given a prompt that contains several solved examples of the desired task along with the problem it must solve. In-context learning is sometimes referred to as “few-shot learning.” 

Unlike task-specific fine-tuning, ICL does not require changing the model’s parameters, which makes it easier to use and accessible to more users. However, ICL is constrained by the model’s context window. For example, GPT-3, had a context window of around 2,000 tokens, which limited the number of examples that could be inserted into the prompt. 

But today’s models support above 100,000 tokens—and more than a million in the case of Gemini 1.5 Pro. You can fit hundreds or thousands of ICL examples in each prompt. 

In their study, the DeepMind researchers investigated how many-shot ICL affects the performance of LLMs in downstream tasks. They experimented with several problem domains, including math problem–solving, question-answering, outcome reward–modeling, translation of low-resource languages, planning and sentiment analysis.

In some cases, they included up to 8,192 ICL examples in one prompt. Their findings show that the model’s performance continues to improve as more examples are added to the prompt. In translation tasks, long-shot ICL on Gemini Pro set a new state-of-the-art performance on Kurdish and Tamil, two low-resource languages. In summarization tasks, many-shot ICL brought Gemini Pro on par with fine-tuned summarization models. Across all tasks, the performance of the model reached its maximum only when the number of in-context examples was scaled to hundreds of thousands of tokens.

Reinforced and unsupervised ICL

The main limitation of many-shot ICL is the need to create a large volume of high-quality human-generated examples, which becomes more exacerbated in reasoning tasks. The researchers propose two techniques to reduce the dependence of many-shot learning on human-generated data.

The first technique, “reinforced ICL,” replaces human-crafted examples with model-generated rationales. The LLM is given a training problem and a few-shot or zero-shot chain-of-thought prompt to sample multiple rationales. Then, assuming that there is a mechanism to verify the final answer, the responses with the correct answer are selected to create an ICL dataset of problem/rationale pairs.

The second technique, “unsupervised ICL,” leverages the LLM’s internal knowledge of the problem. In unsupervised ICL, the prompt is composed of a list of unsolved problems along with a zero-shot or few-shot prompt for the target problem. This obviates the need for human-crafted answers. The researchers hypothesize that when the LLM already possesses the required knowledge to solve a task, adding relevant information to the prompt can help the model better focus on the internal concepts that can solve the problem.

“We find that either using model-generated rationales or only problems can reduce the dependence of many-shot learning on human-generated data,” the researchers write.

Changing model behavior

The researchers also found that many-shot ICL can overcome pre-training biases and learn non-natural language prediction tasks where few-shot ICL struggles. 

For example, the researchers flipped the labels of a sentiment analysis dataset so that it conflicts with sentiment biases the LLM might have learned during training. Their experiments show that as more ICL examples are fitted into the prompt, performance on flipped and abstract labels dramatically improves, approaching that of default labels.

They were also able to use many-shot ICL to repurpose the model for linear classification and sequential parity, a problem that is hard to solve without specialized training.

“This suggests the potential of many-shot learning to adapt to new tasks and domains that might be misaligned with an LLM’s training data,” the researchers write.

What does it mean for enterprises?

As researchers and AI labs continue to extend the context window of LLMs, some experts suggest that there is no longer a need for fine-tuned models or other techniques such as retrieval-augmented generation (RAG). Instead of fine-tuning your models or creating complicated retrieval pipelines, you can just create a prompt with the needed information, training examples and instructions for the downstream task.

However, techniques such as many-shot ICL are not scalable for the moment. If you have an LLM application that receives tens of millions of requests every day, then lengthening every prompt with a few hundred examples will have a significant impact on the speed and costs of inference. 

Many-shot ICL can become an important tool for the exploration and prototyping stage of LLM applications. With it, developers will be able to try out different prompt engineering techniques without worrying about filling the context window. However, once they achieve the desired results, scaling the product will depend on using all the relevant techniques for reducing token consumption and using models that are smaller, faster, and cheaper.