Skip to main content

How Gradient created an open LLM with a million-token context window

inifinite tokens
Credit: VentureBeat with DALL-E 3

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


In a recent collaboration, AI startup Gradient and cloud compute platform Crusoe extended the “context window” of Llama-3 models to 1 million tokens. The context window determines the number of input and output tokens a large language model (LLM) can process. 

Big tech companies and frontier AI labs are locked in a race to extend the context windows of their LLMs. In a few months, models have gone from supporting a few thousand tokens to more than a million in less than a year. However, LLMs with very long context windows are mostly limited to private models such as Anthropic Claude (200k tokens), OpenAI GPT-4 (128k tokens), and Google Gemini (1 million tokens).

The race to create open-source models with long context windows can reshuffle the LLM market and unlock applications that are not possible with private models.

The need for open-source long-context LLMs

Gradient works with enterprise customers who want to integrate LLMs into their workflows. Even before Llama-3 came out, the company was facing context pain points in projects they were working on for their customers.

For example, language models that help in programming tasks, often referred to as “coding copilots,” have become an important development tool in many companies. Standard coding copilots can generate small bits of code at a time, such as a function. Now, companies are looking to extend those capabilities to creating entire modules of code.

“In order to do that, the language model needs to be able to reference an entire code base or maybe multiple GitHub code repositories,” Leo Pekelis, Chief Scientist at Gradient AI, told VentureBeat. 

One way to do it would be to provide the codebase to the LLM piecemeal and make multiple calls. But the process would be slow, complicated, and produce inaccurate results because the model never has access to the entire codebase at any given time.

“Being able to put entire code bases right into a language model context alleviates a lot of these problems because now the language model is able to do what it can do best, which is reason over everything and its working memory and provide an answer that is both more accurate and more efficient,” Pekelis said.

As many companies have restrictions on what kind of data they can send to third parties, they can’t use models such as Gemini or Claude. This set the Gradient team to create their own million-token open model.

Open research

The commercialization of large language models has reduced the incentives for AI labs to share their findings and research. So while tech companies continue to extend the context window of LLMs, they are less likely to release code, data, or details about the techniques they use to optimize and improve their models.

However, this has not prevented the open research community from sharing their findings and contributing to the overall improvement of models. Gradient relied on many papers and open research from universities and institutes across the world.

Their base models were the 8-billion- and 70-billion-parameter versions of Meta’s open model Llama 3, which has a default context window of 8,000 tokens. 

They used techniques developed by Berkeley AI Research (BAIR) on distributed attention, which helped them increase the context length without exploding the memory and compute costs. The initial code implementation came from an open source project from a research institute in Singapore. And the mathematical formulas that enabled the models to learn from long context windows came from an AI research lab in Shanghai.

They used evaluation benchmarks from Nvidia to keep track of the performance of their models in comparison to other long-context LLMs such as Gemini.

“A lot of it wouldn’t have been possible without the open research community,” Pekelis said. “Open research influences our work across the stack.”

Addressing the compute bottleneck

Compute resources is one of the main challenges of doing LLM research. Most AI labs rely on large clusters of GPUs to train and test their models. Gradient teamed up with Crusoe to research long-context LLMs. Crusoe is creating a purpose-built AI cloud that can help its partners build and explore different models cost-efficiently.

“The timing of this collaboration was interesting because we were bringing online an [Nvidia] L40S cluster,” Ethan Petersen, Senior Developer Advocate at Crusoe, told VentureBeat. “Typically when people think about those chips, they think about them in terms of inference and we wanted to showcase that we’re able to do really large-scale training across these as well as inference.”

Big tech companies are competing over the purchase of high-end GPUs such as A100, H100 and the upcoming B100. Each of the chips costs tens of thousands of dollars and the server clusters can easily amount to millions of dollars. 

Crusoe also provides high-end GPUs, including AMD’s MI300X and the whole range of Nvidia GPUs. But they also try to find the best solution for each client. The Crusoe team worked closely with Gradient to customize the L40S cluster and help them considerably cut down the cost of training their models.

“The way that we work with partners like Gradient is just to understand where we can provide the most efficient compute across the different types based on what they’re doing. And in this case, L40S was the right answer,” Patrick McGregor, Chief Product Officer at Crusoe, told VentureBeat. “We can provide a huge amount of value in customizing or tailoring different types of compute offerings.”

“A lot of the Innovation that helped us train these models in a reasonable amount of time and release them roughly a week after Llama-3 came out was exactly in doing some of this network optimization on the L40S cluster,” Pekelis said. “With other cloud compute providers, there’s not as much open communication and that has made a lot of these custom configurations considerably more difficult.”

Evaluating the models

One of the key benchmarks to evaluate long-context windows is the “needle in a haystack” test, where a very specific piece of information is inserted into different parts of a long sequence of text and the model is questioned about it. 

“Our models get near perfect needle-in-a-haystack performance up to around 2 million context length, and that kind of puts us in the realm of what I’ve seen only Gemini 1.5 Pro,” Pekelis said.

However, “needle in a haystack” might not necessarily provide an accurate measure of the model’s full context performance. The researchers also considered more advanced measures such as multiple needles in a haystack or “adversarial needles,” where conflicting pieces of information are inserted into the context and the model is queried about one of them.

They also evaluated their model on RULER, a benchmark released by Nvidia that includes 13 different tasks for evaluating long-context language models with configurable sequence length and task complexity. 

They are also working on making the models more effective at many-shot in-context learning, where the model is configured for a new task on the fly by putting hundreds or even thousands of examples in the prompt.

Enterprise applications

Pekelis believes that long-context open models will make it easier for more companies and developers to build LLM-based applications.

“Right now, there is a bit of a distance in between individual uses and applications of AI and language models and enterprise applications, which are lagging behind a little bit,” Pekelis said. “Just allowing language models to do more and to be able to put more in the context windows, unlocks new applications.”

For example, with longer contexts, agentic systems, where one or more language models are put into multiple roles in a workflow, can do more with fewer calls because they can process much more information with each request. 

Long-context LLMs can also do things that would have otherwise required more complex data processing pipelines. One example is style transfer. Without long-context models, if you wanted a language model to mimic the writing style of a person, you would have to first gather data from different sources. Then you would have to preprocess and summarize the data and figure out a way to feed it into the model or possibly fine-tune the model.

“Here, what we found is that, for example, you can just take all of my past emails and give it to the language model, and it learns how to write like me,” Pekelis said.

LLMs with very long context windows could also reduce the need for retrieval-augmented generation (RAG), where for every prompt, the application must find relevant documents and insert them into the context.

An LLM with infinite context could, theoretically, enable you to insert all your documents into the prompt and let the model pick the most relevant parts for each query, though it would ultimately need to be re-queried with all that context included every time the user started a new chat session (similar to how RAG would need to call upon the database for each query or new chat session).

And of course, long-context windows reduce the barrier to creating prototypes or proofs of concept or even helping product teams understand what they can do with language models.

“A lot of times when we talk to customers, getting across what is possible is often a pretty big first step,” Pekelis said. “Having something that is able to get a prototype or first example up and running and showing the possibility of what it can do for an enterprise is really great.”