Towards Social Foundation Models: A Framework and Synthetic Dataset for Grounding Visual Perspective Taking in Robots
Authors/Creators
Description
The next frontier in robotics is the creation of truly collaborative agents that share our physical space. Achieving this requires robots to develop foundational socio-cognitive abilities. One of them is the ability to establish shared spatial representations between interacting agents. While modern Vision-Language Models possess powerful semantic capabilities, their understanding is not grounded in metric space, preventing them from mastering core skills such as Visual Perspective Taking, the ability to understand the world from another agent's viewpoint. We argue this is not an architectural limitation but a data problem, leading us to propose Social Foundation Models: a new class of models designed to master a curriculum of foundational socio-cognitive primitives. This paper presents a foundational blueprint for this vision. We reformulate Visual Perspective Taking as a 6-DOF pose regression task and introduce SynthVPT, a large-scale, synthetic dataset of procedurally generated RGB images with precise ground-truth annotations. We then present a conceptual framework to exploit the rich prior knowledge of a pre-trained Vision-Language Model, fine-tuning it on our data to create a general-purpose spatial reasoner. This methodology provides a tangible and scalable pathway for teaching embodied agents the foundational socio-cognitive skills needed for genuine human-robot collaboration, moving beyond simple instruction-following towards a truly shared reality.
Files
Towards Social Foundation Models.pdf
Files
(1.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:077bb9ba5e81b9e8ff19e51873be2cc5
|
1.3 MB | Preview Download |