Model Overview

Description

llama-nv-embed-reasoning-3b is a 3.2B-parameter embedding model designed to produce high‑quality sentence and document representations for retrieval, semantic search, and similarity tasks, with a strong focus on reasoning‑heavy content.

Built on a Llama‑style encoder and trained with contrastive objectives on diverse text (including question–answer pairs, technical explanations, and multi‑step reasoning data), the model is optimized to:

Capture deeper logical and semantic relationships beyond surface keyword overlap
Align short queries with long, information‑dense documents
Support retrieval for tasks involving explanations, step‑by‑step reasoning, and problem solving

The model outputs dense vector embeddings suitable for use with standard vector databases and retrieval pipelines. Its 3B size offers a balance between quality and inference efficiency, making it suitable for both experimentation and latency‑sensitive workloads.

This model is for non-commercial/research use only.

License/Terms of Use

The use of this model is governed by the Creative Commons Non-Commercial License. The model is built with meta-llama/Llama-3.2-3B which is released under Llama 3.2 Community License Agreement.

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Deployment Geography

Global

Use Case

llama-nv-embed-reasoning-3b is a compact, accurate embedding model intended for practitioners building applications that retrieve and organize information based on deep semantic and reasoning relationships rather than simple keyword overlap. It is instrumental in text‑only RAG systems, where both queries and documents may contain multi‑step arguments, explanations, or technical details, and where aligning concise questions with long, information‑dense passages is critical. Potential applications include reasoning‑aware RAG for assistants and chatbots, semantic search over technical or knowledge‑base content, intelligent document and ticket retrieval, clustering and deduplication of related texts, and analytics pipelines that require robust similarity signals between questions, rationales, and answers.

Release Date

03/10/2026 via https://huggingface.co/nvidia/llama-nv-embed-reasoning-3b

Model Version(s)

llama-nv-embed-reasoning-3b version 1.

Model Architecture

Architecture Type: Transformer
Network Architecture: meta-llama/Llama-3.2-3B

The llama-nv-embed-reasoning-3b is a transformer-based text embedding model built from meta-llama/Llama-3.2-3B. It has approximately 3.2B parameters.

Input(s)

Input type(s): Text

Input format(s): A list of text strings.

Input parameter: One-Dimensional (1D)

Other properties related to input: The model was trained with a maximum token length of 512 tokens for both queries and passages, while evaluation used a maximum token length of 8192 tokens.

Output(s)

Output type(s): Floats

Output format(s): A list of float arrays.

Output parameter: One-Dimensional (1D)

Other properties related to output: The model outputs an embedding vector with up to 3072 dimensions for each input text string.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Getting Started

Installation

The model requires transformers version 4.51.0 and flash-attention installed.

pip install transformers==4.51.0
pip install flash-attn==2.6.3 --no-build-isolation
pip install accelerate==0.34.2

Usage

Hugging Face Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states, attention_mask):
    """Average pooling with attention mask."""
    last_hidden_states_masked = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    embedding = last_hidden_states_masked.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    embedding = F.normalize(embedding, dim=-1)
    return embedding

model_name = "nvidia/llama-nv-embed-reasoning-3b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model = model.to("cuda:0")
model.eval()
query_prefix = "query:"
document_prefix = "passage:"


queries = [
    "how much protein should a female eat",
    "summit define",
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
queries = [f"{query_prefix} {query}" for query in queries]
documents = [f"{document_prefix} {document}" for document in documents]


batch_queries = tokenizer(queries, padding=True, truncation=True, return_tensors='pt').to("cuda:0")
batch_documents = tokenizer(documents, padding=True, truncation=True, return_tensors='pt').to("cuda:0")

with torch.no_grad():
    outputs_queries = model(**batch_queries)
    outputs_documents = model(**batch_documents)

# Average Pooling
embeddings_queries = average_pool(outputs_queries.last_hidden_state, batch_queries["attention_mask"])
print("Query embeddings:")
print(embeddings_queries)
print(embeddings_queries.shape)
#torch.Size([2, 3072])

embeddings_documents = average_pool(outputs_documents.last_hidden_state, batch_documents["attention_mask"])
print("\nDocument embeddings:")
print(embeddings_documents)
print(embeddings_documents.shape)
#torch.Size([2, 3072])

# Compute similarity scores
scores = (embeddings_queries @ embeddings_documents.T)
print("\nSimilarity scores:")
print(scores.tolist())

#Similarity scores:
#[[0.6688634157180786, 0.23073062300682068], [0.24395054578781128, 0.5622682571411133]]

vLLM

Ensure you are using vllm>=0.14.0.
Start the vLLM server with the following command:

Minimal command (required):

vllm serve \
    nvidia/llama-nv-embed-reasoning-3b \
    --trust-remote-code

If you already have a local copy of the model, you can also pass the local path instead of the HF repo ID.

Optional flags:

--dtype <float32|bfloat16|float16> to force precision (the default is auto, which resolves from model config; this model defaults to BF16).
--data-parallel-size <num_gpus_to_use> for multi-GPU serving.
--port 8000 to set the server port.

Online serving example (OpenAI SDK):

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",  # required by OpenAI SDK; ignored by default in local vLLM
)

response = client.embeddings.create(
    input=['query: summit define'],
    model="nvidia/llama-nv-embed-reasoning-3b",
)
response.data[0].embedding

Offline inference example (Python API, no server required):

from vllm import LLM

llm = LLM(
    model="nvidia/llama-nv-embed-reasoning-3b",
    runner="pooling",
    trust_remote_code=True,
)

outputs = llm.embed(["query: summit define", "passage: a summit is a meeting"])
for output in outputs:
    print(len(output.outputs.embedding))

Software Integration

Runtime Engine(s): Not Applicable

Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere - A100 40GB and A100 80GB ; NVIDIA Hopper - H100 80GB.

Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Training and Evaluation Datasets

Training Dataset

The model is trained on a large, reasoning‑focused text corpus constructed via an LLM‑driven data synthesis pipeline, followed by integration of several existing reasoning datasets including ReasonEmbed, ReasonAug, and ReasonRank. The pipeline builds high‑quality query–document pairs by generating queries from document neighborhoods and then annotating positives and hard negatives with LLMs and retrieval models.

Data Modality: Text

Training Data Size: Less than a Billion Tokens

Data Collection Method by dataset: Hybrid (automated, human, synthetic)

Labeling Method by dataset: Hybrid (automated, human, synthetic)

Properties: The details of synthetic data generation, labeling, and data selection from public datasets are as follows.

1. Query Generation

Starting from a large raw corpus (e.g., the BRIGHT corpus), we first apply a domain‑specific filter to obtain a subset of relevant documents. To prevent test‑set leakage, all documents that are positives for any evaluation set are explicitly removed before question generation.

Each remaining document is treated as an anchor. For every anchor document, we retrieve its top‑4 most similar documents from the corpus using a retrieval model (reason-embed-basic-qwen3-4b-0928). The anchor and its four nearest neighbors form a small document set. Conditioned on each document set, an LLM (Qwen3-235B-A22B) is prompted to generate natural language queries of at most 300 tokens. The prompts explicitly encourage reasoning‑intensive questions, such that answering them requires non‑trivial inference and effectively leverages the associated documents rather than simple keyword matching.

2. Positive Annotation and Hard Negative Mining

Given the generated queries, we construct candidate query–document pairs by pairing each query with documents from the corpus. An LLM (Qwen3-next-80b-a3b-instruct) is then used to identify which documents genuinely support or answer each query, producing a set of positive documents and their similarity distribution.

To further improve training signal, we mine hard negatives using a retrieval model (reason-embed-basic-qwen3-4b-0928). For each query, we retrieve documents based on embedding similarity and select as hard negatives those whose similarity scores are slightly lower than the positives, i.e., documents satisfying

$\text{Sim}_{neg} < \text{Sim}_{pos}$

but still close enough to remain semantically similar. This criterion ensures that hard negatives are challenging (near the decision boundary) while minimizing false positives.

3. Additional Training Corpora

After generating the synthetic query–document data, we further strengthen the retriever by incorporating training data from ReasonEmbed, ReasonAug, and ReasonRank. Based on empirical performance, we select the following domains:

From ReasonEmbed: biology, earth_science, economics, psychology, robotics, sustainable_living, stackoverflow, pony, theoremqa_questions, theoremqa_theorems
From ReasonAug: math, theorem
From ReasonRank: biology, stackoverflow, math-qa, math-theorem, earth_science, robotics

These additional datasets enrich the training distribution with diverse, reasoning‑centric tasks across science, math, and technical domains, improving the model’s robustness and generalization.

Evaluation Dataset

Data Modality: Text

Evaluation Data Size: Approximately 1.38K queries

Data Collection Method by dataset: Hybrid (automated, human, synthetic)

Labeling Method by dataset: Hybrid (automated, human, synthetic)

Properties: The model is evaluated on BRIGHT dataset. BRIGHT is a benchmark for reasoning-intensive information retrieval, where determining the relevance between a query and a document goes beyond lexical or semantic matching and requires deliberate, multi-step reasoning. It covers diverse and advanced domains such as economics, mathematics, programming, and natural sciences, with queries drawn from real human data and carefully curated sources. In BRIGHT, relevant documents often share underlying principles, theories, or algorithms with the query rather than surface-level similarity. As a result, state-of-the-art retrievers that perform well on traditional benchmarks like BEIR and MTEB show substantial performance drops on BRIGHT, highlighting its difficulty and its role in evaluating and advancing retrieval models with genuine reasoning capabilities. More details on BRIGHT can be found on their leaderboard.

Evaluation

Run Evaluation on BRIGHT Leaderboard

pip install mteb==2.8.1
python eval_bright.py --model_name nvidia/llama-nv-embed-reasoning-3b --benchmark "BRIGHT(v1.1)"

Evaluation Results

Performance (nDCG@10) on 12 BRIGHT leaderboard short documents datasets

Model	Avg	Bio.	Earth.	Econ.	Psy.	Rob.	Stack.	Sus.	Leet.	Pony	AoPS	TheoQ.	TheoT.
llama-nv-embed-reasoning-3b	38.3	63.4	60.2	39.5	45.5	32.6	34.0	43.3	37.5	15.0	10.5	39.5	38.5
ReasonEmbed-Qwen3-8B (Redapter)	38.1	55.5	56.6	36.2	47.4	35.3	36.6	39.1	33.6	16.4	12.5	41.4	47.2
ReasonEmbed-Qwen3-4B (Redapter)	37.1	55.4	54.5	34.9	46.9	34.0	36.1	37.4	34.5	13.6	11.3	41.4	45.1
ReasonEmbed-Llama-3.1-8B (Redapter)	36.2	55.4	56.2	35.2	48.5	32.1	37.3	41.1	28.8	16.8	9.1	37.9	36.6
DIVER-Retriever	28.9	41.8	43.7	21.7	35.3	21.0	21.2	25.1	37.6	13.2	10.7	38.4	37.3
Seed-1.5-Embedding	27.2	34.8	46.9	23.4	31.6	19.1	25.4	21.0	43.2	4.9	12.2	33.3	30.5
RaDeR-gte-Qwen2-7B	25.5	34.6	38.9	22.1	33.0	14.8	22.5	23.7	37.3	5.0	10.2	28.4	35.1
ReasonIR-8B	24.4	26.2	31.4	23.3	30.0	18.0	23.9	20.5	35.0	10.5	14.7	31.9	27.2
Qwen3-Embedding-8B	22.8	21.0	33.0	18.4	26.1	15.7	19.4	17.3	33.8	1.2	9.4	39.2	39.3
llama-nemotron-embed-3b-v2*	22.3	31.1	36.7	22.3	28.4	18.0	18.4	20.3	32.1	6.8	12.1	25.1	16.5
Qwen3-Embedding-4B	21.8	17.8	34.7	16.9	23.3	12.5	16.2	16.8	35.7	1.4	9.8	35.5	41.5
BM25	14.5	18.9	27.2	14.9	12.5	13.6	18.4	15.0	24.4	7.9	6.2	10.4	4.9

Notes:

Results are quoted from published paper or results on leaderboard.
Results marked with * are obtained by us using the same experimental setup as ours.

Acknowledgement

This work was done by Jie He (j.he@ed.ac.uk) during his internship with NVIDIA under the mentorship of Yauhen Babakhin and Ronay Ak.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.