Nemotron 3 Nano - A new Standard for Efficient, Open, and Intelligent Agentic Models

Enterprise + Article Published December 15, 2025

If 2025 was the year of AI agents, then 2026 is gearing up to be the year of, well, multi-agents. This leap to the next step requires models that produce a lot of tokens, generated by lightweight accurate models.

However, this transition also forces difficult tradeoffs. Smaller models are fast and cheap but often lack the reasoning depth, robustness, and long context capacity needed for advanced multi-agents. Larger models deliver strong accuracy, but are too slow and expensive when many agents are running in parallel. As agentic systems grow, inference costs spiral, context windows become a bottleneck, and reliability starts to degrade, making efficiency of utmost importance.

Striking the right balance is what led NVIDIA to produce the NVIDIA Nemotron 3 Nano 30B A3B, part of our Nemotron 3 family of models (Nano, Super, and Ultra).

Nano utilizes a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture with a 1M-token context window. (🔥🔥🔥) enabling developers to build high-throughput, reliable agents that are more accurate, more scalable, and capable of specialized sub-tasks in long-running multi-step workflows.

Nemotron 3 Nano Highlights (TL;DR)

  • Hybrid Mamba-Transformer MoE architecture: Mamba‑2 for long-context, low-latency inference combined with transformer attention for high-accuracy, fine-grained reasoning
  • 31.6B total parameters, ~3.6B active per token: Designed for high throughput and low latency
  • Exceptional inference efficiency: Up to 4x faster than Nemotron Nano 2 and up to 3.3x faster than leading models in its size category
  • Best-in-class reasoning accuracy: Across reasoning, coding, tools, and multi-step agentic tasks
  • Reasoning controls: Reasoning ON/OFF modes plus a configurable thinking budget to cap “thinking” tokens and keep inference cost predictable
  • 1M-token context window: Ideal for long-horizon workflows, retrieval-augmented tasks, and persistent memory
  • Fully open: Open Weights, datasets, training recipes, and framework
  • A full open data stack: 3T new high-quality pre-training tokens, 13M cross-disciplinary post-training samples, 10+ RL environments with datasets covering more than 900k tasks in math, coding, reasoning, and tool-use, and ~11k agent-safety traces
  • Easy deployment: Seamless serving with vLLM and SGLang, and integration via OpenRouter, popular inference service providers, and build.nvidia.com endpoints
  • License: Released under the nvidia-open-model-license

A two-panel figure comparing Nemotron 3 Nano with Qwen3-30B and GPT-OSS-20B. The left panel displays accuracy scores, showing Nano equal or higher across benchmarks. The right panel displays inference throughput bars, where the Nano is significantly taller; illustrating 3.3x speed over Qwen3 and 2.2x over GPT-OSS.

Figure 1: Nemotron 3 Nano matches or exceeds the accuracy of Qwen3-30B and GPT-OSS-20B while delivering dramatically higher throughput. In an 8K input / 16K output configuration on a single H200 GPU, Nano achieves 3.3x higher throughput than Qwen3-30B and 2.2x higher than GPT-OSS-20B.

What is Nemotron 3 Nano?

Nemotron 3 Nano (30B/A3B) is our latest small-but-powerful reasoning model, building on the success of Nemotron Nano 2's hybrid Mamba-2 + Transformer architecture, reasoning ON/OFF modes, and explicit thinking budgets—while introducing a major architectural upgrade: a sparse Mixture-of-Experts (MoE) design.

At high level:

  • 31.6B total parameters
  • ~3.6B active parameters per token, thanks to the MoE routing
  • Hybrid layer stack with interleaved Mamba‑2 layers and grouped-query attention (GQA) Transformer layers
  • A learned multi-layer perceptron (MLP) router that activates 6 of 128 experts on each forward pass, delivering both efficiency and reasoning accuracy

This combination enables Nemotron 3 Nano to behave like a much larger model in terms of reasoning quality—while maintaining the speed and cost profile expected of a lightweight architecture

Diagram of the Nemotron-Nano-3-30B-A3B architecture showing four sequential blocks. Each block contains repeating Mamba-2 layers and MoE units, with attention layers interspersed in the first and third blocks. The blocks repeat x5, x3, x1, and x4 times respectively, illustrating the hybrid Mamba-Transformer design with MoE layers replacing FFNs.

Figure 2: Nemotron 3 Nano architecture. It uses a hybrid Mamba-Transformer backbone, similar to Nemotron Nano v2, but replaces standard feed-forward network (FFN) layers with sparse MoE layers to significantly boost efficiency and scalability.

Nemotron 3 Nano is built for agentic, reasoning, tool-use, and chat tasks and supports a
context length up to 1M tokens.

It extends the Nemotron model family we released earlier in the year, continuing the progression toward increasingly accurate and efficient open models for reasoning and agent development.

The roadmap graphic illustrates the evolution of Nemotron model families: **Nemotron 1** enhances Llama models with stronger reasoning capabilities, **Nemotron 2** introduces a hybrid Mamba-Transformer architecture, delivering state-of-the-art accuracy and efficiency, **Nemotron 3** adds sparse MoE to the hybrid design, further improving accuracy, throughput, latency, and overall compute efficiency.

Figure 3: NVIDIA Nemotron family of open models is engineered for advanced reasoning and agentic tasks, delivering leading accuracies, and best-in-class efficiencies

How we built Nemotron 3 Nano

We employed a multi-stage pipeline combining massive-scale pre-training, specialized supervised fine-tuning (SFT), and advanced reinforcement learning techniques to refine the reasoning abilities and agentic behavior.

Pre-Training

Nemotron 3 Nano was trained on a 25-trillion-token corpus (including 2.5T of new Common Crawl tokens), spanning web crawls, code and math, Wikipedia and academic text, multilingual content (15 Pre-training followed a two-phase strategy:

  • Phase 1: Diversity (first 94%)
    Broad , diverse mixture to maximize coverage and generalization.
  • Phase 2: Quality (final 6%)
    High-quality sources such as Wikipedia to refine accuracy and consistency.

Long-Context Extension

Context length for Nemotron 3 Nano was extended by adding a continued pre-training (CPT) stage at 512k sequence length. A mixture of 512k and 4k sequence length training preserved short benchmark scores while extending the context length. We included synthetic data designed to support long-range retrieval, multi-hop reasoning, multi-document information aggregation, and related capabilities across different stages of training.

We are releasing a large portion of these pretraining datasets openly on Hugging Face. These additions contribute 3 trillion new tokens to the Nemotron-Pretraining series, with higher-fidelity coverage of code, math, and reasoning. Enhanced synthetic augmentation and annotation pipelines increase data density and structure, improving training efficiency and directly contributing to Nemotron-3 Nano's strong quality profile.

With Nemotron 3, we’ve learned that quantity without quality isn’t useful. Our pre-training data continues to shift toward efficient data: smarter filtration, rewritten and improved samples, and nearly half a trillion tokens of rescued math and code that previous pipelines would have discarded. This focus on signal over noise directly enables smarter, smaller models that are cheaper to train and run, without sacrificing accuracy.

Post-Training

This included Supervised fine-tuning (SFT) and two distinct stages of reinforcement learning, RLVR and RLHF. These stages specialize the model for agentic workflows, tool use, high-quality reasoning, and chat tasks.

Supervised Finetuning

Our SFT recipe was improved from Nano v2 to better support complex agentic behaviors. Improvements included greater dataset diversity, higher data quality, and explicit training for multi-step and multi-turn reasoning.

The model learns both reasoning ON/OFF modes directly from the chat template:

  • Reasoning ON: multi-step mode, where the model preserves and builds upon its prior chain-of-thought within a task.
  • Reasoning OFF: multi-turn mode, where reasoning content is not carried over across turns, ensuring concise responses.

The graph from Artificial Analysis plots small language reasoning models on intelligence index on the y-axis and output tokens per second on the x-axis. Nemotron 3 Nano delivers the highest throughput efficiency using the hybrid MoE architecture and leading accuracy with advanced Reinforcement Learning using NeMo Gym

Figure 4. Nemotron 3 Nano delivers the highest throughput efficiency using the hybrid MoE architecture and leading accuracy with advanced Reinforcement Learning using NeMo Gym

We are releasing the majority of SFT datasets and codebase openly.

Our new post-training data release also expands the intelligence of the model by design. We added 13 million new post-training samples—nearly tripling our previous release and making this the largest openly available post-training corpus by 2.5×. To reach higher reasoning accuracy, we blended cross-disciplinary domains including code, math, physics, and chemistry to create novel, multi-step problems that don’t exist in scraped web data. This helps the model reason about questions that fall between fields, where real scientific and technical progress often happens.

Multi environment Reinforcement Learning from Verifiable Rewards (RLVR)

Nemotron 3 Nano was trained simultaneously across many distinct environments - spanning math, code, question answering, instruction following, multi-step tool use, multi-turn conversations, and structured output among others, using synchronous GRPO (Group Relative Policy Optimization). This multi-environment RLVR stage ensures uniform improvement across domains, reduced overfitting to any single benchmark, and more reliable agentic behavior in real-world workflows.

This figure shows multiple different environment reward curves over training steps, showcasing how the model was learning many different capabilities simultaneously.

Figure 5: Uniform improvements due to training simultaneously on multiple RL environments.

Models need more than textbooks to train — they need a gym. NVIDIA is one of the only open model providers releasing both reinforcement learning datasets and the environments used to train them. This enables developers to test agents, capture critical edge cases, and prevent model drift over time. In this release, we are adding 10+ new RL environments covering competitive coding, advanced math, and even real-world calendar scheduling.

We are also open-sourcing all the essential RLVR infrastructure—the environments including their datasets and code used to build and scale them. These components form the foundation of the new NVIDIA NeMo Gym library, which enables scalable RL environment construction.

Training at scale is executed using NVIDIA NeMo RL, our high-performance RL library.

Reinforce Learning Using Human Feedback (RLHF)

To further refine the model’s conversational quality, we trained a generative reward model (GenRM) using GRPO on Qwen3-235B-A22B.

Given a conversation history, a new user query, and two candidate assistant responses, the GenRM explicitly reasons about the strengths and weaknesses of each response, produces individual helpfulness scores and generates a relative ranking between the candidates. These reward signals are then used in an RLHF stage to improve helpfulness, coherence, correctness, and overall chat experience in Nemotron 3 Nano.

The combined post-training pipeline—SFT+RLVR+RLHF—produces the final Nemotron 3 Nano 30B-A3B model.

As models evolve into multi-step agents that use tools, they face entirely new safety and security challenges. To support responsible deployment, we are releasing an agentic safety dataset featuring nearly 11,000 labeled traces from realistic, tool-using workflows. This gives developers the data they need to evaluate, diagnose, and mitigate safety risks before agentic systems reach production.

Why We Needed Better RL Infrastructure

During development, the limitations of existing RL tooling became clear. Training large reasoning models with RL is difficult because:

  • Multi-step rollouts are complex to orchestrate
  • Tool integrations are often brittle
  • Orchestration logic can conflict with training loop design
  • Collecting rollout data at scale is slow and difficult
  • Most high-quality RL environments are closed and proprietary

As a result, meaningful RL training has historically been accessible only to major AI labs.

NeMo Gym: Opening RL to Everyone

To overcome these challenges NVIDIA built NeMo Gym, an open-source standardized library for building and scaling RL environments.
NeMo Gym powers reinforcement learning pipelines used in Nemotron 3 Nano, and now gives developers:

  • Ready-to-use RL environments across math, code, tool use, multi-turn reasoning, and agentic workflows
  • The ability to build custom RL environments with verifiable reward logic
  • Ecosystem interoperability with NeMo RL and other training frameworks (TRL, Unsloth, VeRL underway)
  • High-throughput rollout orchestration, enabling large-scale RL training
  • A practical pathway to perform RL on their own models

NeMo Gym is a flexible open source library for building and running RL training environments. It is part of the broader NVIDIA NeMo software suite for end-to-end model training and provides infrastructure for designing, running, and scaling complex RL environments.

Battle tested through the development of the entire Nemotron 3 model family, NeMo Gym includes the core environment development infrastructure, a growing collection of ready-to-use training environments alongside the datasets used in RLVR, and integration with NeMo RL, the high-performance and efficient RL training engine with support for advanced RL training algorithms, end-to-end FP8 training and async RL.

With NeMo Gym, teams can quickly assemble environments using modular server components and templates, integrate external tools, systems, or databases, and orchestrate long-context, multi-step, multi-turn rollouts. This allows training environments to be iterated on and shared independent of the training loop.

Diagram illustrating interaction between an RL Training Framework on the left and NeMo Gym on the right. The training framework sends task prompts  to the agent server in the NeMo Gym. The agent server coordinates with the policy model server and external resources server to collect rollouts and verify task performance. The scored trajectories are returned back to the Training Framework for model updates.

Figure 6: How NeMo Gym fits into the RL training loop: The RL training framework (e.g., NeMo RL) sends task prompts to NeMo Gym, which operates as a set of independent HTTP services. Inside NeMo Gym, the agent server orchestrates rollouts by coordinating the policy model server (generation) and external resources server (tools and rewards). NeMo Gym returns model trajectories and rewards to the training framework, which then updates and refits the policy model.

By decoupling RL environments from RL training frameworks, NeMo Gym works seamlessly with many popular training frameworks (such as NeMo RL), supports high-throughput, concurrent rollout collection, and enables large-scale distributed RL training. This separation of concerns makes it easy to scale RL workflows and adapt environments as training objectives evolve.

To accelerate experimentation, NeMo Gym ships with an expanding RL Hub—a catalog of ready-to-use domain-specific environments that developers can use immediately or extend. Current domains include math, coding, instruction following, multi-step tool use, multi-turn structured conversations. Practitioners can fine-tune models on these environments out of the box, reuse community contributions, or publish their own.

Start Building with Nemotron 3 Nano

Nemotron 3 Nano (30B A3B) delivers state-of-the-art accuracy in an exceptionally cost-efficient package. It offers up to 3.3x higher throughput than leading open-source models of similar size (see Figure 1), while supporting a 1M-token context window —performing well on long-context reasoning benchmarks.

Built for high-volume, real-time execution, Nemotron 3 Nano excels in math and coding, multi-step tool calling, and multi-turn agentic workflows. It also retains the classic Nemotron Thinking ON/OFF modes and Thinking Budget controls, giving developers the ability to tune exactly how much the model thinks for each task.

With this release, we are also introducing NeMo Gym, containing ready-to-use training environments we developed during the course of Nemotron 3 training, and the infrastructure to build your own training environments and scale rollout collection.

We are releasing:

  • Full model weights
  • The complete training recipe, including SFT, RLVR, and RLHF
  • Most of the datasets (pre-training, post-training) used throughout the training pipeline
  • Training frameworks that power Nemotron 3

Everything you need to study, reproduce, or extend the model is available openly.

Get started with Nemotron 3 Nano:

  • Download the model: Now available on Hugging Face.
  • Try hosted endpoints: Run queries instantly on OpenRouter or build.nvidia.com.
  • Deploy at scale: Use our cookbooks for vLLM, TRT-LLM, and SGLang
  • Experiment, develop and run at the edge: Available on edge devices such as NVIDIA RTX AI PCs and Workstations and DGX Spark via Llama.cpp, LM Studio and Unsloth

For a deep dive into the architecture, datasets, and benchmarks, read the full Nemotron 3 Nano Technical Report.

Community

Salom

Sign up or log in to comment