3D and 4D World Modeling: A Survey

Teaser Image
Introduction

This survey reviews 3D and 4D world models — models that learn, predict, and simulate the geometry and dynamics of real environments from multi-modal signals. We unify terminology, scope, and evaluation, and organize the space into three complementary paradigms by representation: VideoGen (image/video-centric), OccGen (occupancy-centric), and LiDARGen (point-cloud-centric).

Definition

VideoGen

What it is. An image/video–centric world model that generates or predicts photorealistic, temporally consistent frames (single- or multi-view) conditioned on past context and optional controls.

Typical inputs: past frames, poses/camera intrinsics-extrinsics, text/route/action hints.

Video synthesis Future prediction What-if simulation Multi-view

OccGen

What it is. An occupancy-centric world model that represents scenes as 3D/4D occupancy fields (e.g., voxel grids with semantics), enabling geometry-aware perception, forecasting, and simulation.

Typical inputs: multi-sensor cues (RGB, depth, events, LiDAR), ego motion, maps.

3D reconstruction Occupancy forecasting Autoregressive simulation Semantic voxels

LiDARGen

What it is. A point-cloud–centric world model that learns high-fidelity geometry and dynamics directly from LiDAR sweeps, suitable for robust 3D understanding, data generation, and sensor-faithful simulation.

Typical inputs: past LiDAR sweeps, ego trajectory, calibration; optional scene priors.

Point-cloud synthesis Future sweeps Trajectory-conditioned Physics-aware

Examples

VideoGen

Reference

Reference

OccGen

Sample 1

LiDARGen

(click to expand)

Projects

Video Generation

0 items

Occupancy Generation

0 items

LiDAR Generation

0 items