LLM Memory: A Seamless, Context-Aware, Policy-Compliant, Multi-Model Chat

The Vera Engineering Blog
2 min readJan 23, 2024

--

A chat log from Vera that shows how your conversation can retain context, even when switching between models.
Vera’s platform switches between GPT-4 and Cohere while remembering that I am Zaphod Beeblebrox, President of the Galaxy.

If it’s wrong to use three hyphenates in a blog post headline, well, we don’t want to be right.

Because in today’s release notes we have something special to tell you about…

LLM “Memory”

Imagine you’re in the Vera chat app (or accessing our API to power your product). You and/or your users are chatting along with GPT-4, when a prompt is submitted that’s best answered by AI21. Maybe it’s due to the new model’s exceptional ability to translate to other languages, maybe it’s the least expensive one in your suite. Either way, your model routing experience opens up a session with a model that hasn’t experienced the prior half of this conversation. Or at least, that’s what it used to do…

Before: you’d lose all the prior context from the conversation.

But, today: Vera transfers the context over into this new session.

From Day One we’ve been eager to provide a Chat-GPT-like experience that takes advantage of the diverse and growing open source LLM catalogue, and this feature brings us one step closer to doing exactly that.

And of course, through your multi-model experience, we’re retaining all of your PII, security, and content filtration policies to ensure no model ever goes off the rails.

Why does this matter?

Many of the most exciting applications of Generative AI live in its conversational interface; it’s a fantastic rubber duck, summarizer, translator, and many other things. But when you tell your personal LLM that your name is Zaphod Beeblebrox, you probably don’t want to have to say it again just to get the best model fit for the job.

We’re taking model routing to the next level by ensuring that, no matter which model you’re talking to, you can get the best, least expensive, most performant model’s response. And along with that, now you can get these benefits with a seamless user-facing-experience that won’t fail to miss a beat.

Elsewhere in this release…

We made improvements to our model hosting and infrastructure to get even faster than we already were.

We fixed a bunch of bugs.

We made search logs even more detailed and searchable.

We put some finishing touches on a feature everyone’s been asking us for, that we’ll tell you all about in a couple weeks. In the meantime, say hi!

--

--