Multi-modal Situated Reasoning in 3D Scenes

NeurIPS 2024 Datasets and Benchmarks

¹Beijing Institute for General Artificial Intelligence, ²Peking University

* indicates equal contribution

Abstract

Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding suffer from severe limitations in data modality, scope, diversity, and scale.

To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios and object modalities within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide both texts, images, and point clouds for situation and question description, aiming to resolve ambiguity in describing situations with single-modality inputs (\eg, texts).

Additionally, we devise the Multi-modal Next-step Navigation (MSNN) benchmark to evaluate models' grounding of actions and transitions between situations. Comprehensive evaluations on reasoning and navigation tasks highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the effectiveness of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models, contributing to advancements in 3D scene understanding for embodied AI.

Data Sample

Scene

Situation

Question

Note: Click the select dropdown to select a scene and a corresponding situation in the scene. Drag to move your view around.

BibTeX

    @inproceedings{linghu2024msr3d,
      author    = {Linghu, Xiongkun and Huang, Jiangyong and Niu, Xuesong and Ma, Xiaojian and Jia, Baoxiong and Huang, Siyuan},
      booktitle = {Advances in Neural Information Processing Systems},
      editor    = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
      pages     = {140903--140936},
      publisher = {Curran Associates, Inc.},
      title     = {Multi-modal Situated Reasoning in 3D Scenes},
      url       = {https://proceedings.neurips.cc/paper_files/paper/2024/file/feaeec8ec2d3cb131fe18517ff14ec1f-Paper-Datasets_and_Benchmarks_Track.pdf},
      volume    = {37},
      year      = {2024}
    }

Multi-modal Situated Reasoning in 3D Scenes

NeurIPS 2024 Datasets and Benchmarks

Abstract

Benchmarks

Data Distribution

MSR3D

Data Collection Pipeline

Data Sample

BibTeX