小巧而強大：利用 Llama Nemotron RAG 模型提升多模態搜尋與視覺文件檢索的準確性

Huggingface·4 個月前

本文介紹了專為多模態搜尋和視覺文件檢索設計的小型 Llama Nemotron 模型，這些模型能以低延遲提供高準確性，並與標準向量資料庫和 RAG 流程無縫整合。

Small Yet Mighty: Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models

How to build accurate, low-latency visual document retrieval with small Llama Nemotron models that work out-of-the-box with standard vector databases

This post walks through two small Llama Nemotron models for multimodal retrieval over visual documents:

Both models are:

We will show how they behave on realistic document benchmarks below.

Why multimodal RAG needs world-class retrieval

Multimodal RAG pipelines combine a retriever with a vision-language model (VLM) so responses are grounded in both retrieved page text and visual content, not just raw text prompts.

Embeddings control which pages are retrieved and shown to the VLM. Reranking models decide which of those pages are most relevant and should influence the answer. If either step is inaccurate, the VLM is more likely to hallucinate—often with high confidence. Using multimodal embeddings together with a multimodal reranker keeps generation grounded in the correct page images and text.

The State-of-the-Art in Commercial Multimodal Search

The llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2 models
are designed for developers building multimodal question-answering and search over large corpora of PDFs and images.

The llama-nemotron-embed-vl-1b-v2 model is a single-vector (dense) embedding model that efficiently condenses visual and textual information into a single representation. This design ensures compatibility with all standard vector databases and enables millisecond-latency search at enterprise scale.

llama-nemotron-rerank-v1-1b-v2 is a cross-encoder reranking model that reorders the top retrieved candidates to improve relevance and boosts downstream answer quality without changing your storage or index format.

We evaluated llama-nemotron-embed-vl-1b-v2 and llama-nemotron-rerank-vl-1b-v2 on five visual document retrieval datasets: the popular ViDoRe V1, V2 and V3, a realistic visual document retrieval benchmark for enterprises composed of 8 public datasets, and two internal visual document retrieval datasets:

Visual Document Retrieval (page retrieval) benchmarks

The table below reports the average retrieval accuracy (Recall@5) across five datasets, focusing specifically on commercially viable dense retrieval models.

We can see that the llama-nemotron-embed-vl-1b-v2 provides better retrieval accuracy (Recall@5) for the image and image+text modalities than its predecessor, llama-3.2-nemoretriever-1b-vlm-embed-v1 and also better on text modality than llama-nemotron-embed-1b-v2, our small text embedding model. Finally, our VLM reranker llama-nemotron-rerank-vl-1b-v2 improves retrieval accuracy further by 7.2%, 6.9% and 6% per modality.

Note: Image+Text modality means that both the page image and its text (extracted using ingestion libraries like NV-Ingest) are fed as input to the embedding model for more accurate representation and retrieval.

Visual Document Retrieval benchmarks (page retrieval) – Avg Recall@5 on DigitalCorpora-10k, Earnings V2, ViDoRe V1, V2, V3

The table below demonstrates the accuracy evaluation of llama-nemotron-rerank-vl-1b-v2 compared to two other publicly available multimodal reranker models: jina-reranker-m0 and MonoQwen2-VL-v0.1. Although jina-reranker-m0, performs well on image-only tasks, its public weights are restricted to non-commercial use (CC-BY-NC). In contrast, llama-nemotron-rerank-vl-1b-v2 offers superior performance across Text and combined Image+Text modalities, and its permissive commercial license makes it an ideal choice for enterprise deployments.

Architectural Highlights & Training Methodology

The llama-nemotron-embed-vl-1b-v2 embedding model is a transformer-based encoder model, with approximately 1.7B parameters. It is a fine-tuned version of the NVIDIA Eagle family of models, using the Llama 3.2 1B language model and SigLip2 400M vision encoder. Embedding models for retrieval are typically trained with a bi-encoder architecture that encodes query and document independently. The model applies mean pooling over the output token embeddings from the language model, so that it outputs a single embedding with 2048 dimensions. Contrastive learning is used to train the embedding model to increase similarity between queries and relevant documents while decreasing similarity to negative samples.

The llama-nemotron-rerank-vl-1b-v2 is a cross-encoder model with approximately 1.7B parameters. It is also a fine-tuned version of an NVIDIA Eagle-family model. The final layer hidden states of the language model are aggregated using a mean pooling strategy, and a binary classification head is fine-tuned for the ranking task. The model was trained with CrossEntropy loss using publicly available and synthetically generated datasets.

How Organizations are Using These Models

Here are three examples of how organizations are applying the new Nemotron embedding and reranking models that you can adapt in your own systems.

Cadence: design and EDA workflows
Cadence models logic design assets such as micro-architecture and specification documents, constraints, and verification collateral as connected multimodal documents. As a result, an engineer can ask, “I want to extend the interrupt controller to support a low power state, show me which spec sections need changes,” and instantly surface the most relevant requirements. The system can then suggest a few alternative specification-update strategies, compare their tradeoffs, and generate the corresponding spec edits for the option the user selects.

IBM: domain-heavy storage and infra docs
IBM Storage treats each page of long PDFs—product guides, configuration manuals, and architecture diagrams—as a multimodal document, embeds it, and uses the reranker to prioritize pages where domain-specific terms, acronyms, and product names appear in the correct context before sending them to downstream LLMs. This improves how AI systems interpret storage concepts and reason over complex infrastructure documentation.

ServiceNow: chat over large sets of PDFs
ServiceNow uses multimodal embeddings to index pages from organizational PDFs and then applies the reranker to select the most relevant pages for each user query in its “Chat with PDF” experiences. By keeping high-scoring pages in context across turns, their agents maintain more coherent conversations and help users navigate large document collections more effectively.

Get Started

You can try the models directly:

Plug the new models into your existing RAG stack, or combine them with other open models on Hugging Face to build multimodal agents that understand your PDFs, not just their extracted text.

Stay up to date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, YouTube and the Nemotron channel on Discord.

Community

·
Sign up or
log in to comment

— Huggingface