Transformer+transformer architecture for image captioning in Indonesian language

Wijaya, Bryan Christofer; Sugiarto, Hendrik Santoso

doi:10.11591/ijai.v14.i3.pp2338-2346

Published June 1, 2025 | Version v1

Journal article Open

Transformer+transformer architecture for image captioning in Indonesian language

1. Calvin Institute of Technology

Image captioning in Indonesian language poses a significant challenge due to the complex interplay between visual and linguistic comprehension, as well as the scarcity of publicly available datasets. Despite considerable advancements in this field, research specifically targeting the Indonesian language remains scarce. In this paper, we propose a novel image captioning model employing a transformer-based architecture for both the encoder and decoder components. Our model is trained and evaluated on the pre-translated Flickr30k dataset in the Indonesian language. We conduct acomparative analysis of various transformer configurations and convolutional neural network (CNN)-recurrent neural network (RNN) architectures. Our findings highlight the superior performance of a vision transformer (ViT) as the visual encoder, combined with IndoBERT as the textual decoder. This architecture achieved a BLEU-4 score of 0.223 and a ROUGE-L score of 0.472.

Files

62 26891.pdf

Files (3.8 MB)

Name	Size	Download all
62 26891.pdf md5:dbd9a5c2869d5dd883f760639af4a7c1	3.8 MB	Preview Download

118

Views

Downloads

Show more details

	All versions	This version
Views	118	118
Downloads	10	10
Data volume	38.2 MB	38.2 MB

More info on how stats are collected....

DOI

Resource type

Journal article

Publisher

Zenodo

Published in

IAES International Journal of Artificial Intelligence (IJ-AI), 14(3), 2338-2346, ISSN: 2252-8938, 2025.

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: October 14, 2025
Modified: October 14, 2025

Transformer+transformer architecture for image captioning in Indonesian language

Authors/Creators

Description

Files

62 26891.pdf

Files (3.8 MB)