• Login
    View Item 
    •   USU-IR Home
    • Faculty of Computer Science and Information Technology
    • Department of Computer Science
    • Undergraduate Theses
    • View Item
    •   USU-IR Home
    • Faculty of Computer Science and Information Technology
    • Department of Computer Science
    • Undergraduate Theses
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Generasi Caption dan Ekstraksi Keyword untuk Gambar dengan Fine-Tuning Vision Transformer dan GPT-2 serta Fitur Konversi Audio untuk Peningkatan Aksesibilitas

    Caption Generation and Keyword Extraction for Images Using Fine-Tuned Vision Transformer and GPT-2 With Audio Conversion Feature for Enhanced Accessibility

    Thumbnail
    View/Open
    Cover (943.6Kb)
    Fulltext (3.662Mb)
    Date
    2025
    Author
    Ginting, Elin Betsey
    Advisor(s)
    Amalia
    Hardi, Sri Melvani
    Metadata
    Show full item record
    Abstract
    The increasing volume and variety of images, often lacking accompanying information or descriptions, pose a new challenge in image processing. This research develops an image captioning system using Transformer models, integrated into a web-based platform. The system is also equipped with Keyword extraction from the generated captions and audio conversion of the captions to enhance accessibility. The dataset used for training was independently developed by collecting various types of images from the internet and manually annotating each with a caption. In total, the dataset consists of 28.009 images covering categories such as photographs, paintings, cartoons, illustrations, carvings, and sculptures. The methodology involves the application of a Vision Transformer (ViT) model as the encoder to extract features from the images, and GPT-2 as the decoder to generate text-based captions. The model training process was carried out using a supervised Fine-tuning approach with a batch size of 4, 10 epochs, and gradient clipping to maintain training stability. The model’s performance was evaluated using ROUGE metrics to assess the similarity between generated captions and reference captions. The trained model achieved a training loss of 0.0187 and a validation loss of 0.147831. Evaluation results show that the model obtained a ROUGE-1 score of 60.6806, ROUGE-2 of 46.842, ROUGE-L of 52.5587, and ROUGE-Lsum of 52.0473. The developed model was then uploaded to the Hugging Face Hub and integrated into a website via the Hugging Face API. Additionally, the website uses the Web Speech API to convert caption text into audio. These findings demonstrate that the integration of models for image captioning task has great potential to improve accessibility to visual information through digital platforms.
    URI
    https://repositori.usu.ac.id/handle/123456789/104776
    Collections
    • Undergraduate Theses [1235]

    Repositori Institusi Universitas Sumatera Utara - 2025

    Universitas Sumatera Utara

    Perpustakaan

    Resource Guide

    Katalog Perpustakaan

    Journal Elektronik Berlangganan

    Buku Elektronik Berlangganan

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of USU-IRCommunities & CollectionsBy Issue DateTitlesAuthorsAdvisorsKeywordsTypesBy Submit DateThis CollectionBy Issue DateTitlesAuthorsAdvisorsKeywordsTypesBy Submit Date

    My Account

    LoginRegister

    Repositori Institusi Universitas Sumatera Utara - 2025

    Universitas Sumatera Utara

    Perpustakaan

    Resource Guide

    Katalog Perpustakaan

    Journal Elektronik Berlangganan

    Buku Elektronik Berlangganan

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV