Nvidia 發布 Nemotron Speech ASR，支援串流與批次工作負載

Hacker News·4 個月前

Nvidia 發布了 Nemotron-Speech-Streaming-En-0.6b，這是一款用於英文自動語音辨識 (ASR) 的統一 AI 模型。該模型專為低延遲串流和高吞吐量批次應用而設計，採用快取感知架構，可提高效率並減少延遲。

nvidia

nemotron-speech-streaming-en-0.6b

	like
102



		Follow
	
	NVIDIA
47k

Nemotron Speech ASR

|
|

Nemotron-Speech-Streaming-En-0.6b is the first unified model in the Nemotron Speech family, engineered to deliver high-quality English transcription across both low-latency streaming and high-throughput batch workloads. The model natively supports punctuation and capitalization and offers runtime flexibility with configurable chunk sizes, including 80ms, 160ms, 560ms, and 1120ms.

Why Choose nvidia/nemotron-speech-streaming-en-0.6b?

This model consists of a cache-aware streaming 🦜 Parakeet (FastConformer) encoder with an RNN-T decoder. It is designed for real-time speech-to-text applications where low latency is critical, such as voice assistants, live captioning, and conversational AI systems. Unlike traditional "buffered" streaming, the cache-aware architecture enables continuous transcription by processing only new audio chunks while reusing cached encoder context. This significantly improves computational efficiency and minimizes end-to-end delay without sacrificing accuracy.

🗣️ Experience Nemotron-Speech-Streaming-En-0.6b in action here: https://huggingface.co/spaces/nvidia/nemotron-speech-streaming-en-0.6b

This model is ready for commercial/non-commercial use.

Read more about the model in the dev blog and check out the paper.

Explore more from NVIDIA:

For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at developer.nvidia.com.
Join the community to access tools, support, and resources to accelerate your development with NVIDIA's NeMo, Riva, NIM, and foundation models.

What is Nemotron?
NVIDIA Developer Nemotron
NVIDIA Riva Speech
NeMo Documentation

Access Model Inference and Examples:

Model Architecture

Architecture Type: FastConformer-CacheAware-RNNT

The model is based on the Cache-Aware [1] FastConformer [2] architecture with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The cache-aware streaming design enables efficient processing of audio in chunks while maintaining context from previous frames. Unlike buffered inference, this model maintains caches for all encoder self-attention and convolution layers. This enables reuse of hidden states at every streaming step, where cached activations eliminate redundant computations. As a result, there are no overlapping computations; each processed frame is strictly non-overlapping.

The caching schema of self-attention and convolution layers for consecutive chunks is as follows. For more details, please refer to [1].

Network Architecture:

NVIDIA NeMo

To train, fine-tune or perform inference with this model, you will need to install NVIDIA NeMo[4]. We recommend you install it after you've installed Cython and latest PyTorch version.

How to Use this Model

The model is available for use in the NeMo Framework, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Loading the Model

Streaming Inference

You can use the cache-aware streaming inference script from NeMo - NeMo/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py

You can also run streaming inference through the pipeline method, which uses NeMo/examples/asr/conf/asr_streaming_inference/cache_aware_rnnt.yaml configuration file to build end‑to‑end workflows with punctuation and capitalization (PnC), inverse text normalization (ITN), and translation support.

Setting up Streaming Configuration

Latency is defined by the att_context_size param, where att_context_size = {num_frames_left_context, num_frame_right_context}, all measured in 80ms frames:

Here, chunk size = current frame + right context; each chunk is processed in non-overlapping fashion.

Input

This model accepts single-channel (mono) audio sampled at 16,000 Hz. At least 80ms duration is required.

Output

The model outputs English text transcriptions with punctuation and capitalization. The output text might be empty if input audio doesn't contain any speech.

Datasets

Training Datasets

The majority of the training data comes from the English portion of the Granary dataset [3]:

In addition, the following datasets were used:

Data Modality: Audio and text

Audio Training Data Size: 285k hours

Data Collection Method: Human - All audios are human recorded

Labeling Method: Hybrid (Human, Synthetic) - Some transcripts are generated by ASR models, while some are manually labeled

Evaluation Datasets

The model was evaluated on the HuggingFace ASR Leaderboard datasets:

Performance

ASR Performance (w/o PnC)

ASR performance is measured using the Word Error Rate (WER). Both ground-truth and predicted texts are processed using whisper-normalizer version 0.1.12.

The following tables show the WER on the HuggingFace OpenASR leaderboard datasets:

Word Error Rate (WER) for chunk size of 1.12s

WER for chunk size of 0.56s

WER for chunk size of 0.16s

WER for chunk size of 0.08s

Software Integration

Runtime Engine: NeMo 25.11

Supported Hardware Microarchitecture Compatibility:

Test Hardware:

Preferred/Supported Operating System(s): Linux

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

References

[1] Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

[2] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[3] NVIDIA Granary

[4] NVIDIA NeMo Framework

Datasets used to train

					nvidia/nemotron-speech-streaming-en-0.6b

Collection including

					nvidia/nemotron-speech-streaming-en-0.6b

Papers for

					nvidia/nemotron-speech-streaming-en-0.6b

Evaluation results

— Hacker News

你的個人知識庫