OVOW turns a monocular video into structured, instance-level 4D mesh assets that can be inspected as geometry and dropped directly into physics simulators.
The scene is decomposed into physically independent, watertight mesh instances, each with a unique label and motion category.
Rigid and non-rigid motion is unified through 6-DoF pose trajectories and direct per-vertex deformation — no skeletons, no category priors.
Physics-grounded assembly resolves ground contact and inter-object support, exporting coherent scenes (URDF) for direct use in physics engines.
A fully training-free pipeline that composes pretrained foundation models into four stages, from scene understanding to a simulation-ready 4D mesh scene.
A vision-language model (Qwen3-VL) discovers, uniquely names, and motion-classifies every instance as static, rigid, or deformable; SAM3 produces dense per-frame masks.
Amodal inpainting (FLUX.2) with a feed-forward image-to-3D model (Hi3DGen) handles static/rigid objects, while Motion324 yields topology-consistent mesh sequences for deformable ones; VGGT recovers metric scale.
An iterative render-match-optimize loop (RoMa v2 + FoundationPose) recovers metric scale, orientation, and per-frame 6-DoF pose trajectories, decoupling global motion from local vertex deformation.
Ground-plane estimation and contact projection enforce ground contact and inter-object support, exporting a physically coherent, simulation-ready scene with recovered HDR environment lighting.
For each example: the input monocular video (left) and OVOW's reconstructed instance-level 4D mesh scene (right), across tabletop, indoor, and in-the-wild scenarios.
Browser-native previews of OVOW outputs — instance-level meshes with recovered rigid and non-rigid motion.
Drag to orbit · Scroll to zoom · animated scenes loop automatically
Instance-level 4D meshes reconstructed from a single monocular video, with recovered motion.
Instance-separated scene meshes recovered from a single in-the-wild image.
Interactive viewers use GLB exports of OVOW reconstructions. Static objects are shown with auto-rotation; dynamic scenes play their recovered 4D motion on loop.
OVOW's reconstructed, simulation-ready scenes dropped into a physics engine — instances rest, topple, and collide under gravity on the recovered ground plane, confirming physical stability.
@misc{chen2026videoworldturningmonocular,
title={One Video, One World: Turning Monocular Video into Physical 4D Scenes},
author={Junhao Chen and Boran Zhang and Mingjin Chen and Henghaofan Zhang and Saining Zhang and Congcong Zhu and Hao Zhao and Ruqi Huang and Zhihao Li and Yufei Wang},
year={2026},
eprint={2606.31388},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.31388},
}