NVIDIA NeMo: Build and Customize Your Own LLMs (with Tutorial)

What Is NVIDIA NeMo?

NVIDIA NeMo is an end-to-end platform designed for the development and customization of generative AI models. It facilitates the delivery of enterprise-ready models through precise data curation, advanced customization options, retrieval-augmented generation (RAG), and accelerated performance capabilities. This platform enables the training and deployment of generative AI models anywhere, including clouds, data centers, and edge environments, providing a full-stack solution that encompasses support, security, and API stability as part of NVIDIA AI Enterprise.

This is part of a series of articles about AI open source projects

In this article:

Key Features of NVIDIA NeMo

Let’s look at the important features of NeMo and how they facilitate AI projects.

State-of-the-Art Training Techniques

NeMo utilizes GPU-accelerated data curation tools like NeMo Curator, which prepare large-scale, high-quality datasets for pretraining generative AI models. This allows for training on thousands of compute cores, significantly reducing time and improving the accuracy of LLMs.

Advanced LLM Customization Tools

NeMo Customizer, a scalable microservice, enables the fine-tuning and alignment of large language models for domain-specific use cases. It leverages model parallelism to accelerate training and supports scaling to multiple GPUs and nodes for fine-tuning larger models.

Optimized AI Inference With NVIDIA Triton

NVIDIA NIM, including inference engines like NVIDIA Triton Inference Server, facilitates seamless AI inferencing at scale. This allows for accelerated generative AI inference, enabling deployment of AI applications with confidence, whether on-premises or in the cloud.

Easy-to-Use Recipes and Tools for Generative AI

NeMo offers modular and reusable architecture that accelerates the development of conversational AI models. It supports end-to-end workflows from data processing to deployment, and includes pre-trained models for ASR, NLP, and TTS, which can be fine-tuned or used directly.

Related content: Read our guide to Nvidia container toolkit

Best-in-Class Pretrained Models

NeMo Collections provide a range of pre-trained models and training scripts, enabling quick application development or fine-tuning for specific tasks. As of the time of this writing, NeMo supports Llama 2, Stable Diffusion, and NVIDIA’s Nemotron-3 8B family of models.

Optimized Retrieval-Augmented Generation

NeMo Retriever offers high-performance, low-latency information retrieval, enhancing generative AI applications with enterprise-grade RAG capabilities. This allows for real-time generation of business insights and data utilization.

NeMo Components

Here are the key components making up NVIDIA NeMo:

NeMo Core: Provides foundational elements like the Neural Module Factory, which supports both training and inference, offering a streamlined process for developing conversational AI models.
NeMo Collections: These are specialized modules and models for ASR, NLP, and TTS, which include both pre-trained models and training scripts for a variety of tasks, making the platform highly versatile.
Neural Modules: Act as the building blocks for NeMo, defining trainable components like encoders and decoders. These modules can be interconnected to construct comprehensive models.
Application Scripts: NeMo simplifies the deployment of conversational AI models by offering ready-to-use scripts. These allow users to quickly train or fine-tune models on specific datasets, catering to a wide range of conversational AI applications.

Quick Tutorial: Getting Started with NVIDIA NeMo

The following tutorial will show you how to install NeMo and use it to train a GPT-style model with the entire text of the Wikipedia. The code was shared in the official NeMo documentation.

Prerequisites: Install Conda and Python on your machine, and ensure you have a GPU with at least 16 GB of memory.

Step 1: Installing NeMo

NVIDIA recommends installing NeMo using Conda. Here is how to do it:

Create a new Conda environment by typing: conda create --name nemo python==3.10.12
Activate the new environment with: conda activate nemo
Afterwards, install PyTorch. Get the installation command using the configurator tool. It will look something like this:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Step 2: Download and Preprocess Wikipedia Data

Download Wikipedia data (around 20GB) using the following command. Note that this will take several hours with a regular broadband connection:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Next, extract raw data from the dump file using this command:

pip install wikiextractor
python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl

The file train_data.jsonl now contains the training data, under the text field. We’ll train a tokenizer using the pre-built HuggingFace BPE tokenizer for GPT2:

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

Finally, we’ll convert the training data into a memory map format. This will make training more efficient, especially when using multiple GPUs.

python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input=train_data.jsonl \
--json-keys=text \
--tokenizer-library=megatron \
--vocab gpt2-vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt \
--output-prefix=hfbpe_gpt_training_data \
--append-eod \
--workers=32

Step 3: Training GPT-Style Model

Let’s train the model. We are using a configuration that results in 124 million parameters, which can be trained on a single 16GB GPU using float16.

The following code trains the model:

python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
    --config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \
    --config-name=megatron_gpt_config \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.max_epochs=null \
    trainer.max_steps=300000 \
    trainer.val_check_interval=300 \
    trainer.log_every_n_steps=50 \
    trainer.limit_val_batches=50 \
    trainer.limit_test_batches=50 \
    trainer.accumulate_grad_batches=1 \
    trainer.precision=16 \
    model.micro_batch_size=6 \
    model.global_batch_size=192 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.max_position_embeddings=1024 \
    model.encoder_seq_length=1024 \
    model.hidden_size=768 \
    model.ffn_hidden_size=3072 \
    model.num_layers=12 \
    model.num_attention_heads=12 \
    model.init_method_std=0.021 \
    model.hidden_dropout=0.1 \
    model.layernorm_epsilon=1e-5 \
    model.tokenizer.vocab_file=gpt2-vocab.json \
model.tokenizer.merge_file=gpt2-merges.txt \
    model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document] \
    model.data.num_workers=2 \
    model.data.seq_length=1024 \
    model.data.splits_string=\'980,10,10\' \
    model.optim.name=fused_adam \
    model.optim.lr=6e-4 \
    model.optim.betas=[0.9,0.95] \
    model.optim.weight_decay=0.1 \
    model.optim.sched.name=CosineAnnealing \
    model.optim.sched.warmup_steps=750 \
    model.optim.sched.constant_steps=80000 \
    model.optim.sched.min_lr=6e-5 \
    exp_manager.resume_if_exists=True \
    exp_manager.resume_ignore_no_checkpoint=True \
    exp_manager.create_checkpoint_callback=True \
    exp_manager.checkpoint_callback_params.monitor=val_loss \
    exp_manager.checkpoint_callback_params.save_top_k=3 \
    exp_manager.checkpoint_callback_params.mode=min \
    exp_manager.checkpoint_callback_params.always_save_nemo=False

To monitor the training process, you can use Tensorboard:

tensorboard --logdir nemo_experiments --bind_all

That’s it! You can watch as NeMo automatically trains your GPT-style model using the Wikipedia data.

Optimizing Your AI Infrastructure with Run:ai

Run:ai automates resource management and orchestration and reduces cost for the infrastructure used to train LLMs and other computationally intensive models. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.