DS1 spectrogram: Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

2606.26493

Authors

Mohammad Shoeybi,Bryan Catanzaro,Fitsum Reda,John Kamalu,Roger Waleffe

Abstract

Diffusion language models offer a promising alternative to autoregressive models due to their potential for parallel and iterative generation. However, existing approaches use a single network for both context representation and iterative denoising, forcing one model to serve both roles and limiting its capacity for either role.

We propose TwoTower, a block-wise autoregressive diffusion model that decouples these roles into two towers: a frozen AR context tower that causally processes clean tokens, and a trainable diffusion denoiser tower with bidirectional block attention that refines noisy blocks via cross-attention to the context. Built on Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model, and trained on approximately 2.1T tokens, Nemotron-TwoTower retains 98.7% of the autoregressive baseline's quality while offering 2.42X higher wall-clock generation throughput.

We release the code and model weights at https://huggingface.co/collections/nvidia/nemotron-twotower.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.