A Hierarchical Framework for Vision-Language Model-Driven Robotic Control in Simulated Environments
Authors/Creators
Description
This work presents a practical proof-of-concept for a hierarchical control system for robots, bridging the gap between high-level natural language instructions and low-level motor actions. The architecture leverages a large Vision-Language Model (VLM), specifically LLaVA-1.5 7B, to act as a high-level "director." The VLM processes visual input from a simulated PyBullet environment and a user's textual command to select an appropriate action from a predefined Skill Library. These skills, representing simple, pre-programmed motor behaviors (e.g., 'wave_hand', 'point_forward'), are then executed by a low-level controller. Our experiments demonstrate that the VLM can successfully perceive a multi-object scene, interpret ambiguous natural language commands, and correctly choose the corresponding skill to fulfill the user's intent. This project, accessible via its GitHub repository (https://github.com/zorino96/VLM-Robot-Director), serves as a clear demonstration of how large pre-trained models can be integrated into robotic systems to enable more intuitive and flexible human-robot interaction, using only open-source tools.
Files
VLM_Abstract_Reasoning_in_Robotics.ipynb
Files
(8.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:b5454e94b11499c7de0bb1f94964aeed
|
8.7 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/zorino96/VLM-Robot-Director
- Programming language
- Python