Zero-1-to-G

Recent advances in 2D image generation have achieved remarkable quality, largely driven by the capacity of diffusion models and the availability of large- scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct 3D generation on Gaussian splats through 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pre- trained 2D diffusion models. To incorporate 3D awareness, we introduce cross- view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct 3D generative model to effectively utilize 2D pretrained diffusion priors, enabling efficient training and improved generalization to unseen objects. Exten- sive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.

Our method achieves the best fidelity among baseline methods. (LGM and InstantMesh are two stage methods, LN3Diff and ours are single stage methods)

Our generated splatter images can directly render RGB and normal maps simultaneously.