Skip to main content

What’s a NIM? Nvidia Inference Microservices is new approach to gen AI model deployment that could change the industry

Credit: Nvidia
Credit: Nvidia

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Nvidia is aiming to dramatically accelerate and optimize the deployment of generative AI large language models (LLMs) with a new approach to delivering models for rapid inference.

At Nvidia GTC today, the AI giant is announcing its Nvidia Inference Microservices (NIM) software technology, which packages optimized inference engines, industry-standard APIs and support for AI models into containers for easy deployment. While NIM provides prebuilt models, it also allows organizations to bring their own proprietary data and will support and help to accelerate Retrieval Augmented Generation (RAG) deployment. 

The NIM technology marks a major milestone for gen AI deployment as the foundation of Nvidia’s next-generation strategy for inference that will have an impact on almost every model developer and data platform in the space. Nvidia has worked with large software vendors including SAP, Adobe, Cadence, CrowdStrike, Getty Images, ServiceNow and Shutterstock, as well as a long list of data platform vendors including BOX, Cohesity, Cloudera, Databricks, Datastax, Dropbox, NetApp and Snowflake to support NIM. 

NIM is part of the NVIDIA Enterprise AI software suite, which is getting its 5.0 release today at GTC.

“We believe that Nvidia NIM is the best software package, the best runtime for developers to build on top of, so that they can focus on the enterprise applications,” Manuvir Das, VP of enterprise computing at Nvidia, explained during a press pre-briefing.

What exactly is Nvidia NIM?

At the most basic level, a NIM is a container full of microservices. 

The container can include any type of model, ranging from open to proprietary models, that can run anywhere there is an Nvidia GPU — be that in the cloud, or even just in a laptop. In turn, that container can be deployed anywhere a container can run, which could be a Kubernetes deployment in the cloud, a Linux server or even a serverless Function-as-a-Service model. Nvidia will have the serverless function approach on its new ai.nvidia.com website, where developers can go to begin working with NIM prior to deployment.

To be clear, a NIM isn’t a replacement for any prior approach to model delivery from Nvidia. It’s a container that includes a highly optimized model for Nvidia GPUs along with the necessary technologies to improve inference.

In response to a question from VentureBeat during the press briefing, Kari Briski, VP for gen AI software product management, emphasized that Nvidia is a platform company. She noted that all the ways that Nvidia has helped to support inference, with tools such as Tensor RT, as well as the Triton Inference Server, are still important technologies.

“What we have found is that putting all these pieces together for a production environment to run gen AI at scale requires a lot of know-how and expertise, so that’s why we’ve packaged it together,” said Briski.

NIMs will help power responsive RAG capabilities for enterprises

A strong use case for NIMs will be in support of RAG deployment models.

“Pretty much every customer we talk to has already implemented dozens or hundreds of these RAGs,” said Das. “The question really is about how do we go to production? How do we take the prototyping that we’ve done, and now deliver real business value by going into production with the use of these models?”

Nvidia and several leading data vendors are hoping that NIMs are the answer to that question. Vector database capabilities are critical to enabling RAG, and there are several vector database vendors supporting NIMs. Among those are Apache Lucene, Datastax, Faiss, Kinetica, Milvus, Redis and Weaviate.

The RAG approach will benefit from the integration of NVIDIA NeMo Retriever microservices inside of NIM deployments. NeMo Retriever is a technology that Nvidia announced in November 2023 as an approach to help enable RAG with an optimized approach for data retrieval.

“When you add a retriever that’s both accelerated and trained on some of the highest quality datasets, it matters,” said Briski.