Published January 1, 2025 | Version v1
Conference paper Open

Breaking the linear barrier: A multi-modal LLM-based system for navigating complex web content

Description

Visually impaired users still face fundamental obstacles when interacting with complex, dynamic websites. Conventional screen readers expose pages in a strict linear order, offer little semantic context for visual media, and provide limited context regarding the page content. This paper introduces a multi-modal accessibility framework combining Large Language Models (LLMs), Computer Vision, and dynamic DOM manipulation to significantly enhance semantic clarity, non-linear navigation, and interaction richness. By interpreting visual and textual web content contextually and adapting it into an intuitive, conversationally navigable interface, our method provides a foundation for visually impaired users to interact effectively with previously inaccessible or challenging digital experiences.The deployment of a functional prototype on a modern web browser illustrates the capability of the proposed system to interact with diverse websites and tasks. The research team selected Canada's most frequented websites to assess the system's efficacy in enhancing contextual understanding of the page content and enabling navigation through pages and actions via a chat-driven interface. A comprehensive demonstration was executed using a prominent ticketing site, which facilitated users in obtaining a deeper understanding of the page while guiding them towards the successful purchase of concert tickets. By illustrating how vision language and LLM reasoning can be coupled with low-level browser control, this work lays the groundwork for future efforts in performance optimization, large-scale evaluation, and personalization across diverse web contexts.

Files

moterani-2025-breaking.pdf

Files (7.2 MB)

Name Size Download all
md5:35480988e042e631b38660b12883404d
7.2 MB Preview Download

Additional details