One Video, One World: Turning Monocular Video into Physical 4D Scenes

ECCV 2026

Junhao Chen^1,2 Boran Zhang³ Mingjin Chen⁴ Henghaofan Zhang⁵
Saining Zhang⁶ Congcong Zhu³ Hao Zhao¹ Ruqi Huang¹
Zhihao Li² Yufei Wang²

¹Tsinghua University ²SparcAI Inc. ³University of Science and Technology of China
⁴The Hong Kong Polytechnic University ⁵University of Electronic Science and Technology of China
⁶Nanyang Technological University

arXiv PDF Interactive Demo Code

**OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video.** Given a single video, our method decomposes the scene into physically independent mesh instances and recovers rigid-body motions and non-rigid mesh deformations, yielding instance-level meshes ready for downstream physics simulation and editing — demonstrated on multi-object collisions, rigid-body motions, and deforming objects across tabletop, indoor, and in-the-wild scenarios.

Abstract

We introduce OVOW, the first training-free system that reconstructs instance-level, simulation-ready 4D mesh scenes from a single monocular video. Recent 4D reconstruction achieves impressive rendering quality, but its outputs (e.g., implicit fields, Gaussian primitives, or point clouds) lack the watertight topology, instance separation, and standardized physical interfaces required by physics simulators and embodied AI. OVOW closes this gap with a four-stage pipeline: a vision-language model discovers, labels, and motion-classifies all instances; category-aware reconstruction yields per-instance meshes for rigid objects and topology-consistent mesh sequences for deformable ones; an iterative render-match-optimize procedure recovers metric scale and 6-DoF pose trajectories; and physics-grounded assembly enforces ground contact and inter-object support. Crucially, we model all motion, rigid and non-rigid, through direct vertex deformation without category-specific priors or skeleton rigging, producing watertight mesh scenes ready for downstream physics simulation and editing. We further establish the first benchmark for structured Video-to-4D evaluation, with metrics for geometric correctness, instance separation, and physical plausibility beyond visual fidelity; the same pipeline doubles as a scalable engine for synthesizing paired video-to-4D simulation data for future 4D world models and embodied AI. Across two synthetic benchmarks (static and 4D), OVOW attains the best overall layout and geometry accuracy and the lowest photometric and semantic error among all baselines, and on monocular video runs one to two orders of magnitude faster than the baselines, while downstream physics simulation confirms its physical stability.

Overview

OVOW turns a monocular video into structured, instance-level 4D mesh assets that can be inspected as geometry and dropped directly into physics simulators.

Instance-Level Scenes

The scene is decomposed into physically independent, watertight mesh instances, each with a unique label and motion category.

4D Mesh Motion

Rigid and non-rigid motion is unified through 6-DoF pose trajectories and direct per-vertex deformation — no skeletons, no category priors.

Simulation Ready

Physics-grounded assembly resolves ground contact and inter-object support, exporting coherent scenes (URDF) for direct use in physics engines.

Method

A fully training-free pipeline that composes pretrained foundation models into four stages, from scene understanding to a simulation-ready 4D mesh scene.

VLM-Guided Scene Decomposition

A vision-language model (Qwen3-VL) discovers, uniquely names, and motion-classifies every instance as static, rigid, or deformable; SAM3 produces dense per-frame masks.

Instance-Level Mesh Reconstruction

Amodal inpainting (FLUX.2) with a feed-forward image-to-3D model (Hi3DGen) handles static/rigid objects, while Motion324 yields topology-consistent mesh sequences for deformable ones; VGGT recovers metric scale.

Spatiotemporal Pose & Deformation Recovery

An iterative render-match-optimize loop (RoMa v2 + FoundationPose) recovers metric scale, orientation, and per-frame 6-DoF pose trajectories, decoupling global motion from local vertex deformation.

Physics-Grounded Scene Assembly

Ground-plane estimation and contact projection enforce ground contact and inter-object support, exporting a physically coherent, simulation-ready scene with recovered HDR environment lighting.

Training-free

No task-specific training — pretrained foundation models only

3.35 s / frame

On monocular video — one to two orders of magnitude faster than baselines

82.7%

Reconstructed scenes stay stable under gravity in a physics engine

Video-to-4D Results

For each example: the input monocular video (left) and OVOW's reconstructed instance-level 4D mesh scene (right), across tabletop, indoor, and in-the-wild scenarios.

Input

OVOW 4D

Tabletop

Input

OVOW 4D

Dragon

Input

OVOW 4D

Sea Turtle

Input

OVOW 4D

Polar Bear

Input

OVOW 4D

Eagle

Input

OVOW 4D

Toy Car

Input

OVOW 4D

Airplane

Input

OVOW 4D

Kitchen

Input

OVOW 4D

Courtyard

Input

OVOW 4D

Bedroom

Interactive 3D / 4D Demo

Browser-native previews of OVOW outputs — instance-level meshes with recovered rigid and non-rigid motion.

Drag to orbit · Scroll to zoom · animated scenes loop automatically

Video-to-4D Reconstruction

Instance-level 4D meshes reconstructed from a single monocular video, with recovered motion.

Car

Rigid-body reconstruction with metric scale and per-frame pose from a monocular clip.

Sea Turtle

Deformable object: a topology-consistent mesh sequence with non-rigid vertex deformation.

Airplane

Object-level mesh recovered and tracked across the input sequence.

Image-to-3D Reconstruction

Instance-separated scene meshes recovered from a single in-the-wild image.

Reference image for scene reconstruction case 1

Scene Reconstruction I

Instance-separated meshes with faithful shape and accurate layout from one image.

Reference image for scene reconstruction case 2

Scene Reconstruction II

Watertight, simulation-ready geometry recovered via iterative scale-orientation refinement.

Reference image for scene reconstruction case 3

Scene Reconstruction III

Multi-object scene with correct spatial layout and per-instance separation.

Interactive viewers use GLB exports of OVOW reconstructions. Static objects are shown with auto-rotation; dynamic scenes play their recovered 4D motion on loop.

Simulation and Editing

OVOW's reconstructed, simulation-ready scenes dropped into a physics engine — instances rest, topple, and collide under gravity on the recovered ground plane, confirming physical stability.

Toy-Car Tabletop I

Toy-Car Tabletop II

Toy-Car Tabletop III

Triceratops Scene I

Triceratops Scene II

Dragon Scene

BibTeX

@misc{chen2026videoworldturningmonocular,
      title={One Video, One World: Turning Monocular Video into Physical 4D Scenes},
      author={Junhao Chen and Boran Zhang and Mingjin Chen and Henghaofan Zhang and Saining Zhang and Congcong Zhu and Hao Zhao and Ruqi Huang and Zhihao Li and Yufei Wang},
      year={2026},
      eprint={2606.31388},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.31388},
}