Skip to content

GenManip: LLM-driven Simulation for Generalizable Instruction-Following Manipulation

GenManip Teaser

Abstract

Robotic manipulation in real-world settings remains challenging, especially regarding robust generalization. Existing simulation platforms lack sufficient support for exploring how policies adapt to varied instructions and scenarios. Thus, they lag behind the growing interest in instruction-following foundation models like LLMs, whose adaptability is crucial yet remains underexplored in fair comparisons. To bridge this gap, we introduce $\textbf{GenManip}$, a realistic tabletop simulation platform tailored for policy generalization studies. It features an automatic pipeline via GPT-driven $\textit{task-oriented scene graph}$ to synthesize large-scale, diverse tasks using 10K annotated 3D object assets. To systematically assess generalization, we present $\textbf{GenManip-Bench}$, a benchmark of 200 scenarios refined via human-in-the-loop corrections. We evaluate two policy types: (1) modular manipulation systems integrating foundation models for perception, reasoning, and planning, and (2) end-to-end policies trained through scalable data collection. Results show that while data scaling benefits end-to-end methods, modular systems enhanced with foundation models generalize more effectively across diverse scenarios. We anticipate this platform to facilitate critical insights for advancing policy generalization in realistic conditions.

Leaderboard
Modular methods MLLM Spatial Appearance Common Sense Long-Horizon Overall
    SR SPL SR SPL SR SPL SR SPL SR SPL
CoPA GPT-4o 25.00 4.57 29.41 7.12 20.59 4.19 17.59 4.26 20.33 4.53
CoPA GPT-4.5 25.00 5.44 26.47 4.53 29.41 7.64 11.11 2.97 23.00 5.26
CoPA Claude-3.7-Sonnet 8.33 0.59 14.71 2.15 2.94 0.26 7.41 1.08 8.67 1.15
CoPA Gemini-2.0-Flash 8.33 1.81 38.24 7.83 17.65 4.12 8.33 1.76 19.00 4.02
CoPA Qwen2.5-VL-72B 8.33 2.01 23.53 5.60 10.78 1.96 12.96 2.72 13.67 3.05
MOKA GPT-4o 8.33 0.50 11.76 1.28 11.76 2.15 5.56 0.63 10.00 1.29
MOKA GPT-4.5 16.67 3.75 23.53 2.20 17.65 3.26 11.11 2.25 14.00 2.14
MOKA Claude-3.7-Sonnet 0.00 0.00 0.00 0.00 20.59 6.02 8.33 1.12 7.00 2.05
MOKA Gemini-2.0-Flash 8.33 0.72 29.41 7.31 0.00 0.00 0.00 0.00 12.00 2.66
MOKA Qwen2.5-VL-72B 8.33 1.40 17.65 3.73 8.82 0.70 8.33 1.64 11.00 1.84

† CoPA and MOKA are reproduced within GenManip with adaptations.

Analysis Results

Success Rate Analysis

Performance Comparison: GR-1 vs ACT

0 20 40 60 80 100 100 200 300 500 1000 Data Size Success Rate (%) 33.0% 78.0% 84.5% 91.0% 95.0% 40.0% 60.0% 66.5% 67.0% 72.5%

Ablation Studies Analysis

End-to-End Method Generalization

0 20 40 60 80 100 Methods Success Rate 78.2±19.5 57.0±8.4 45.5±10.1 43.5±26.1 8.5±5.5 0.0±1.2 0.0±0.0 Limited (fg) Limited (all) Full (fg) Full (all) Full Experiment Unseen Instruction Unseen Objects