A Hierarchical Framework for Vision-Language Model-Driven Robotic Control in Simulated Environments

Mahdi, Zrng

doi:10.5281/zenodo.17172536

Published September 22, 2025 | Version v1

Software Open

A Hierarchical Framework for Vision-Language Model-Driven Robotic Control in Simulated Environments

Mahdi, Zrng

This work presents a practical proof-of-concept for a hierarchical control system for robots, bridging the gap between high-level natural language instructions and low-level motor actions. The architecture leverages a large Vision-Language Model (VLM), specifically LLaVA-1.5 7B, to act as a high-level "director." The VLM processes visual input from a simulated PyBullet environment and a user's textual command to select an appropriate action from a predefined Skill Library. These skills, representing simple, pre-programmed motor behaviors (e.g., 'wave_hand', 'point_forward'), are then executed by a low-level controller. Our experiments demonstrate that the VLM can successfully perceive a multi-object scene, interpret ambiguous natural language commands, and correctly choose the corresponding skill to fulfill the user's intent. This project, accessible via its GitHub repository (https://github.com/zorino96/VLM-Robot-Director), serves as a clear demonstration of how large pre-trained models can be integrated into robotic systems to enable more intuitive and flexible human-robot interaction, using only open-source tools.

Files

VLM_Abstract_Reasoning_in_Robotics.ipynb

Files (8.7 kB)

Name	Size	Download all
VLM_Abstract_Reasoning_in_Robotics.ipynb md5:b5454e94b11499c7de0bb1f94964aeed	8.7 kB	Preview Download

Additional details

Repository URL: https://github.com/zorino96/VLM-Robot-Director
Programming language: Python

Views

Downloads

Show more details

	All versions	This version
Views	19	19
Downloads	5	5
Data volume	43.6 kB	43.6 kB

More info on how stats are collected....

DOI

Resource type

Software

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more; MIT License

A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code. Read more

Technical metadata

Created: September 22, 2025
Modified: September 22, 2025

A Hierarchical Framework for Vision-Language Model-Driven Robotic Control in Simulated Environments

Authors/Creators

Description

Files

VLM_Abstract_Reasoning_in_Robotics.ipynb

Files (8.7 kB)

Additional details

Software