EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

Shang Liu^1,2*, Chenjie Cao^1,2,3*, Chaohui Yu^1,2, Wen Qian^1,2, Jing Wang^1,2, Fan Wang¹,

¹Alibaba DAMO Academy, ²Hupan Lab, ³Fudan University

Abstract

Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth’s surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m$\times$600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.

Aerial-Earth3D Dataset

we present Aerial-Earth3D, the largest 3D aerial dataset created to date. This dataset comprises 50,028 meticulously curated scenes, each spanning 600m x 600m, sourced across the mainland U.S. with 45 million multi-view frames captured from Google Earth. To effectively cover valid and diverse regions with limited viewpoints, we carefully design heuristic camera poses based on simulated 3D scenes built upon DEM, OSM, and MS-Building datasets. Since Google Earth does not provide source meshes, we reconstruct 3D meshes via InstantNGP, applying several post-processing techniques to extract surface planes, fix normals, and refine mesh connectivity. Then these meshes are voxelized as the ground truth for structural generation. Additionally, we employ AIE-SEG to create semantic maps as mesh attributes, comprising 25 distinct classes. Aerial-Earth3D stands out as a large-scale 3D aerial dataset characterized by its diverse terrains and 3D annotations, significantly advancing both 3D generation and reconstruction efforts.

Data collection of Google Earth.

The overall data pipeline of Aerial-Earth3D.

Attributes visualization of Aerial-Earth3D Dataset

Aerial-Earth3D Dataset visualization

Framework

EarthCrafter separately models texture and structure in the latent space compressed by TexVAE and StructVAE as illustrated in (a) and (b), respectively. EarthCrafter also contains textural and structural flow-matching models, i.e., TexFM and StructFM, to model related latent presentations. We show the overall pipeline of EarthCrafter in (c), while dashed boxes denote optional conditions.

Overall pipeline of EarthCrafter

Network structure of StructVAE. Network structure of TextFM.

Network structure of TextureFM.

Infinite Scene generation under seamntic condition

Note: To validate the ability of infinite scene generation, we obtain large vertical semantic map with size (748 x 748) from validation patch’s source scene mesh, which has overlap with training patches (256 x 256). In each figure of this section, The image in left is the input semantic condition, The image in middle are rendered image from generated scene voxels, The video in right is the rendered scene video from generated 2DGS scene.