GaussianGPT — Forget Diffusion: TU Munich Just Taught a Transformer to Build 3D Scenes Token by Token

Every 3D generation model you know uses diffusion. FLUX, Stable Diffusion 3D, Shap-E, Point-E — they all start from noise and iteratively refine. GaussianGPT from TU Munich just threw that playbook out and did something nobody expected: it generates 3D Gaussian scenes token by token, exactly like ChatGPT writes text.

GaussianGPT teaser showing unconditional scene generation, completion and outpainting
GaussianGPT: unconditional generation, completion from partial inputs, and 12mx12m outpainting. Source: arXiv 2603.26661 / TU Munich

The Story

Researchers at TU Munich’s Visual Computing Lab — Nicolas von Lutzow, Barbara Rossle, Katharina Schmid, and Matthias Niessner — published GaussianGPT in late March 2026. The core idea: compress a 3D Gaussian Splatting scene into discrete tokens via a sparse 3D convolutional autoencoder with vector quantization, then model those tokens with a GPT-2-scale causal transformer using 3D Rotary Positional Embeddings (3D RoPE). The model generates scenes voxel by voxel — predicting first where geometry is, then what it looks like.

This is fundamentally different from diffusion. Diffusion models start from pure noise and iteratively denoise. GaussianGPT builds from scratch, one spatial chunk at a time. The consequence is a set of capabilities diffusion models struggle with:

  • Scene completion: Give it a quarter of a room, it fills in the rest.
  • Outpainting: Extend a 4mx4m scene to 12mx12m via sliding context window.
  • Temperature-controlled sampling: Dial up creativity or stay faithful to context.
  • Variable generation length: Stop early for a rough sketch, run longer for detail.
GaussianGPT pipeline architecture diagram
The full GaussianGPT pipeline: Gaussian primitives to 3D voxel grid to discrete token codebook to causal GPT transformer. Source: arXiv 2603.26661 / TU Munich

Why You Should Care

The 3D generation space has been almost entirely owned by diffusion since 2022. GaussianGPT demonstrates there is a viable alternative paradigm — one that plays to the unique strengths of Gaussian Splatting. Diffusion generates holistically: the whole scene at once. Autoregressive generation is compositional — it builds piece by piece, which means you can pause and steer mid-scene, complete partially-observed spaces, or generate arbitrarily large environments via sliding window.

For VFX artists, architects, and game developers, this distinction matters. You don’t want a model that hallucinates an entire environment from scratch — you want one that can logically extend what you’ve already built, or fill in gaps from a lidar scan. The outputs are native 3DGS scenes you can drop into Gaussian Splat Studio, SuperSplat, or Unreal directly. No conversion. No re-export.

GaussianGPT large-scale outpainting extending 3D scenes to 12mx12m
Large-scale outpainting: starting from a seed chunk, GaussianGPT generates coherent 12mx12m environments via sliding window. Source: arXiv 2603.26661 / TU Munich
GaussianGPT scene completion from partial 3D Gaussian input
Scene completion: provide one quarter of a room, GaussianGPT generates the rest with structural coherence across walls, floor, and furniture. Source: arXiv 2603.26661 / TU Munich

Try It / Follow Them

The project page demo videos are worth your time — especially the completion video where the model reconstructs a full room from a corner slice. Matthias Niessner is one of the most cited researchers in 3D reconstruction; his group does not publish vaporware.

IK3D Lab Take

The real shift here is conceptual. For three years, the question in 3D generation has been: “how do we make diffusion work better in 3D?” GaussianGPT asks a different question: what if we applied the architecture that made language models great? The fact that it works — and produces coherent results on indoor scenes, real-world ScanNet data, and large synthetic environments — suggests this is a genuine research direction, not just an academic exercise.

The code is coming soon. When it drops, this will be worth running if you have the GPU headroom. If you are building AI-augmented workflows around 3D scenes — in Blender, Unreal, or your own pipeline — this is the paper to bookmark this month. The completion and outpainting capabilities open up workflows that simply do not exist yet for working artists and developers. Watch this space.

Sharing is caring!

Leave a Reply

Your email address will not be published. Required fields are marked *