Generasi Caption dan Ekstraksi Keyword untuk Gambar dengan Fine-Tuning Vision Transformer dan GPT-2 serta Fitur Konversi Audio untuk Peningkatan Aksesibilitas

Ginting, Elin Betsey

Generasi Caption dan Ekstraksi Keyword untuk Gambar dengan Fine-Tuning Vision Transformer dan GPT-2 serta Fitur Konversi Audio untuk Peningkatan Aksesibilitas

dc.contributor.advisor	Amalia
dc.contributor.advisor	Hardi, Sri Melvani
dc.contributor.author	Ginting, Elin Betsey
dc.date.accessioned	2025-07-02T04:07:04Z
dc.date.available	2025-07-02T04:07:04Z
dc.date.issued	2025
dc.identifier.uri	https://repositori.usu.ac.id/handle/123456789/104776
dc.description.abstract	The increasing volume and variety of images, often lacking accompanying information or descriptions, pose a new challenge in image processing. This research develops an image captioning system using Transformer models, integrated into a web-based platform. The system is also equipped with Keyword extraction from the generated captions and audio conversion of the captions to enhance accessibility. The dataset used for training was independently developed by collecting various types of images from the internet and manually annotating each with a caption. In total, the dataset consists of 28.009 images covering categories such as photographs, paintings, cartoons, illustrations, carvings, and sculptures. The methodology involves the application of a Vision Transformer (ViT) model as the encoder to extract features from the images, and GPT-2 as the decoder to generate text-based captions. The model training process was carried out using a supervised Fine-tuning approach with a batch size of 4, 10 epochs, and gradient clipping to maintain training stability. The model’s performance was evaluated using ROUGE metrics to assess the similarity between generated captions and reference captions. The trained model achieved a training loss of 0.0187 and a validation loss of 0.147831. Evaluation results show that the model obtained a ROUGE-1 score of 60.6806, ROUGE-2 of 46.842, ROUGE-L of 52.5587, and ROUGE-Lsum of 52.0473. The developed model was then uploaded to the Hugging Face Hub and integrated into a website via the Hugging Face API. Additionally, the website uses the Web Speech API to convert caption text into audio. These findings demonstrate that the integration of models for image captioning task has great potential to improve accessibility to visual information through digital platforms.	en_US
dc.language.iso	id	en_US
dc.publisher	Universitas Sumatera Utara	en_US
dc.subject	Image Captioning	en_US
dc.subject	Vision Transformer (ViT)	en_US
dc.subject	GPT-2	en_US
dc.subject	Keywords Extraction	en_US
dc.subject	Text-to-Speech Conversion	en_US
dc.subject	Hugging Face API	en_US
dc.subject	Web Speech API	en_US
dc.title	Generasi Caption dan Ekstraksi Keyword untuk Gambar dengan Fine-Tuning Vision Transformer dan GPT-2 serta Fitur Konversi Audio untuk Peningkatan Aksesibilitas	en_US
dc.title.alternative	Caption Generation and Keyword Extraction for Images Using Fine-Tuned Vision Transformer and GPT-2 With Audio Conversion Feature for Enhanced Accessibility	en_US
dc.type	Thesis	en_US
dc.identifier.nim	NIM211401050
dc.identifier.nidn	NIDN0121127801
dc.identifier.kodeprodi	KODEPRODI55201#Ilmu Komputer
dc.description.pages	78 Pages	en_US
dc.description.type	Skripsi Sarjana	en_US
dc.subject.sdgs	SDGs 4. Quality Education	en_US

Files in this item

Name:: Generasi Caption dan Ekstraksi ...
Size:: 943.6Kb
Format:: PDF
Description:: Cover

View/Open

Name:: Elin Betsey Br Ginting_Generasi ...
Size:: 3.662Mb
Format:: PDF
Description:: Fulltext

View/Open

This item appears in the following Collection(s)

Undergraduate Theses [1248]
Skripsi Sarjana

Show simple item record