Published September 22, 2025 | Version v1
Software Open

A Hierarchical Framework for Vision-Language Model-Driven Robotic Control in Simulated Environments

Authors/Creators

Description

This work presents a practical proof-of-concept for a hierarchical control system for robots, bridging the gap between high-level natural language instructions and low-level motor actions. The architecture leverages a large Vision-Language Model (VLM), specifically LLaVA-1.5 7B, to act as a high-level "director." The VLM processes visual input from a simulated PyBullet environment and a user's textual command to select an appropriate action from a predefined Skill Library. These skills, representing simple, pre-programmed motor behaviors (e.g., 'wave_hand', 'point_forward'), are then executed by a low-level controller. Our experiments demonstrate that the VLM can successfully perceive a multi-object scene, interpret ambiguous natural language commands, and correctly choose the corresponding skill to fulfill the user's intent. This project, accessible via its GitHub repository (https://github.com/zorino96/VLM-Robot-Director), serves as a clear demonstration of how large pre-trained models can be integrated into robotic systems to enable more intuitive and flexible human-robot interaction, using only open-source tools.

Files

VLM_Abstract_Reasoning_in_Robotics.ipynb

Files (8.7 kB)

Name Size Download all
md5:b5454e94b11499c7de0bb1f94964aeed
8.7 kB Preview Download

Additional details

Software

Repository URL
https://github.com/zorino96/VLM-Robot-Director
Programming language
Python