Show simple item record

dc.contributor.advisorAmalia
dc.contributor.advisorHardi, Sri Melvani
dc.contributor.authorGinting, Elin Betsey
dc.date.accessioned2025-07-02T04:07:04Z
dc.date.available2025-07-02T04:07:04Z
dc.date.issued2025
dc.identifier.urihttps://repositori.usu.ac.id/handle/123456789/104776
dc.description.abstractThe increasing volume and variety of images, often lacking accompanying information or descriptions, pose a new challenge in image processing. This research develops an image captioning system using Transformer models, integrated into a web-based platform. The system is also equipped with Keyword extraction from the generated captions and audio conversion of the captions to enhance accessibility. The dataset used for training was independently developed by collecting various types of images from the internet and manually annotating each with a caption. In total, the dataset consists of 28.009 images covering categories such as photographs, paintings, cartoons, illustrations, carvings, and sculptures. The methodology involves the application of a Vision Transformer (ViT) model as the encoder to extract features from the images, and GPT-2 as the decoder to generate text-based captions. The model training process was carried out using a supervised Fine-tuning approach with a batch size of 4, 10 epochs, and gradient clipping to maintain training stability. The model’s performance was evaluated using ROUGE metrics to assess the similarity between generated captions and reference captions. The trained model achieved a training loss of 0.0187 and a validation loss of 0.147831. Evaluation results show that the model obtained a ROUGE-1 score of 60.6806, ROUGE-2 of 46.842, ROUGE-L of 52.5587, and ROUGE-Lsum of 52.0473. The developed model was then uploaded to the Hugging Face Hub and integrated into a website via the Hugging Face API. Additionally, the website uses the Web Speech API to convert caption text into audio. These findings demonstrate that the integration of models for image captioning task has great potential to improve accessibility to visual information through digital platforms.en_US
dc.language.isoiden_US
dc.publisherUniversitas Sumatera Utaraen_US
dc.subjectImage Captioningen_US
dc.subjectVision Transformer (ViT)en_US
dc.subjectGPT-2en_US
dc.subjectKeywords Extractionen_US
dc.subjectText-to-Speech Conversionen_US
dc.subjectHugging Face APIen_US
dc.subjectWeb Speech APIen_US
dc.titleGenerasi Caption dan Ekstraksi Keyword untuk Gambar dengan Fine-Tuning Vision Transformer dan GPT-2 serta Fitur Konversi Audio untuk Peningkatan Aksesibilitasen_US
dc.title.alternativeCaption Generation and Keyword Extraction for Images Using Fine-Tuned Vision Transformer and GPT-2 With Audio Conversion Feature for Enhanced Accessibilityen_US
dc.typeThesisen_US
dc.identifier.nimNIM211401050
dc.identifier.nidnNIDN0121127801
dc.identifier.kodeprodiKODEPRODI55201#Ilmu Komputer
dc.description.pages78 Pagesen_US
dc.description.typeSkripsi Sarjanaen_US
dc.subject.sdgsSDGs 4. Quality Educationen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record