Generasi Caption dan Ekstraksi Keyword untuk Gambar dengan Fine-Tuning Vision Transformer dan GPT-2 serta Fitur Konversi Audio untuk Peningkatan Aksesibilitas
Caption Generation and Keyword Extraction for Images Using Fine-Tuned Vision Transformer and GPT-2 With Audio Conversion Feature for Enhanced Accessibility

Date
2025Author
Ginting, Elin Betsey
Advisor(s)
Amalia
Hardi, Sri Melvani
Metadata
Show full item recordAbstract
The increasing volume and variety of images, often lacking accompanying information or descriptions, pose a new challenge in image processing. This research develops an image captioning system using Transformer models, integrated into a web-based platform. The system is also equipped with Keyword extraction from the generated captions and audio conversion of the captions to enhance accessibility. The dataset used for training was independently developed by collecting various types of images from the internet and manually annotating each with a caption. In total, the dataset consists of 28.009 images covering categories such as photographs, paintings, cartoons, illustrations, carvings, and sculptures. The methodology involves the application of a Vision Transformer (ViT) model as the encoder to extract features from the images, and GPT-2 as the decoder to generate text-based captions. The model training process was carried out using a supervised Fine-tuning approach with a batch size of 4, 10 epochs, and gradient clipping to maintain training stability. The model’s performance was evaluated using ROUGE metrics to assess the similarity between generated captions and reference captions. The trained model achieved a training loss of 0.0187 and a validation loss of 0.147831. Evaluation results show that the model obtained a ROUGE-1 score of 60.6806, ROUGE-2 of 46.842, ROUGE-L of 52.5587, and ROUGE-Lsum of 52.0473. The developed model was then uploaded to the Hugging Face Hub and integrated into a website via the Hugging Face API. Additionally, the website uses the Web Speech API to convert caption text into audio. These findings demonstrate that the integration of models for image captioning task has great potential to improve accessibility to visual information through digital platforms.
Collections
- Undergraduate Theses [1235]