What it is. An image/video–centric world model that generates or predicts
photorealistic, temporally consistent frames (single- or multi-view) conditioned on past context and optional controls.
Typical inputs: past frames, poses/camera intrinsics-extrinsics, text/route/action hints.
Video synthesis
Future prediction
What-if simulation
Multi-view
What it is. An occupancy-centric world model that represents scenes as 3D/4D occupancy
fields (e.g., voxel grids with semantics), enabling geometry-aware perception, forecasting, and simulation.
Typical inputs: multi-sensor cues (RGB, depth, events, LiDAR), ego motion, maps.
3D reconstruction
Occupancy forecasting
Autoregressive simulation
Semantic voxels
What it is. A point-cloud–centric world model that learns high-fidelity geometry and
dynamics directly from LiDAR sweeps, suitable for robust 3D understanding, data generation, and sensor-faithful simulation.
Typical inputs: past LiDAR sweeps, ego trajectory, calibration; optional scene priors.
Point-cloud synthesis
Future sweeps
Trajectory-conditioned
Physics-aware