Hybrid Vision-and-Language Fusion: A Threefold Learning Approach for elevating Image Captioning through Adaptive Strategies

Bhandari, Sravya; Kumar, Abhishek; Batta, Priya; Shambhu, Shankar

doi:10.5281/zenodo.18373532

Published December 1, 2025 | Version v1

Journal article Open

Hybrid Vision-and-Language Fusion: A Threefold Learning Approach for elevating Image Captioning through Adaptive Strategies

1. Masters in Artificial Intelligence & Machine Learning, Liverpool John Moores University, India
2. Dept. of CSE, Chandigarh University, India
3. Amity School of Engineering and Technology, Amity University Punjab, Mohali, India
4. Chitkara University School of Engineering & Technology, Chitkara University, India

Contributors

Contact person:

Batta, Priya¹

1. Dept. of CSE, Chandigarh University, India; Amity School of Engineering and Technology, Amity University Punjab, Mohali, India

Image captioning is a significant area of application for artificial intelligence techniques. When a machine can interpret an image similar to humans, it indicates a higher intelligence level and comprehension of the image. This research displays advancements in real-time image collection and labeling systems using a triad of computer vision, natural language processing, and classification. The approach employs three deep learning models to generate human-level natural language descriptors, resulting in a user-friendly system. The model comprises a multimodal pipeline of deep learning architectures, enabling the extraction of probabilistic features for each object category. Our model surpasses other image captioning models, achieving a CIDEr score of 37.93% on the common MS-COCO Captioning task test baseline, thereby exhibiting superior syntactical saliency when integrated with advanced object features. Additionally, we observed that incorporating an intermediate step of clustering objects before classification enhances the final model's performance. By implementing these methodologies, we have developed a more capable and accurate model, proficient in object classification and generating informative image descriptions. Such capabilities can significantly augment human comprehension and decision-making across various applications, particularly in advancing sustainable cities and communities, fostering quality education through improved accessibility of visual content, promoting industry, innovation, and infrastructure with cutting-edge AI technologies.

Notes

Published in Evergreen, Volume 12, Issue 04. Citation formats available via DOI link.

Files

p1840-1866.pdf

Files (3.2 MB)

Name	Size	Download all
p1840-1866.pdf md5:fec38bef1a938dd2f40ee8f30b055584	3.2 MB	Preview Download

Additional details

Is identical to: Journal article: 10.5109/7402620 (DOI)
Is supplemented by: Other: https://citation.crossref.org/?doi=10.5109/7402620 (URL)

	All versions	This version
Views	15	15
Downloads	16	16
Data volume	58.4 MB	58.4 MB

Hybrid Vision-and-Language Fusion: A Threefold Learning Approach for elevating Image Captioning through Adaptive Strategies

Authors/Creators

Contributors

Contact person:

Description

Notes

Files

p1840-1866.pdf

Files (3.2 MB)

Additional details

Related works