• Login
    View Item 
    •   USU-IR Home
    • Faculty of Computer Science and Information Technology
    • Department of Computer Science
    • Undergraduate Theses
    • View Item
    •   USU-IR Home
    • Faculty of Computer Science and Information Technology
    • Department of Computer Science
    • Undergraduate Theses
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Optimasi Caption Otomatis: Studi Refinement Caption dari Model Vision-Language Menggunakan GPT-3.5

    Automatic Caption Optimization: A Caption Refinement Study of Vision-Language Models Using GPT-3.5

    Thumbnail
    View/Open
    Cover (600.9Kb)
    Fulltext (3.534Mb)
    Date
    2025
    Author
    Saragih, Andhika Mandalanta
    Advisor(s)
    Harumy, Henny Febriana
    Hardi, Sri Melvani
    Metadata
    Show full item record
    Abstract
    Generating relevant and engaging captions for images is a common challenge across various digital platforms, such as social media and e-commerce, making the need for systems that can automatically produce image captions increasingly important. However, vision-language models like BLIP still face limitations in generating descriptions that are truly accurate and contextually appropriate, especially when dealing with ambiguous or complex visual elements. This study aims to optimize the automatic captioning results from the BLIP model through stylistic refinement using GPT-3.5. The initial captions generated by BLIP are adapted into four different language styles Formal, Informal, Social Media, and E-Commerce using a prompt-based approach. Evaluation is conducted quantitatively using BERTScore and subjectively through user surveys. The quantitative results show that the refined captions have a high degree of semantic similarity with the reference captions, with the highest F1 score in the Formal style (0.888) and the lowest in the Social Media style (0.805). Subjective evaluation from 83 users yielded an average score of 4.19 on a 5-point Likert scale, indicating positive user perceptions in terms of readability, relevance, and ease of use. Nevertheless, the system still has limitations, such as visual misinterpretations by BLIP for instance, mistaking patterns on clothing for other objects. This study demonstrates that the integration of BLIP and GPT-3.5 can produce image captions that are adaptive to various communication styles and holds strong potential for further development in multimodal applications.
    URI
    https://repositori.usu.ac.id/handle/123456789/106211
    Collections
    • Undergraduate Theses [1235]

    Repositori Institusi Universitas Sumatera Utara - 2025

    Universitas Sumatera Utara

    Perpustakaan

    Resource Guide

    Katalog Perpustakaan

    Journal Elektronik Berlangganan

    Buku Elektronik Berlangganan

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of USU-IRCommunities & CollectionsBy Issue DateTitlesAuthorsAdvisorsKeywordsTypesBy Submit DateThis CollectionBy Issue DateTitlesAuthorsAdvisorsKeywordsTypesBy Submit Date

    My Account

    LoginRegister

    Repositori Institusi Universitas Sumatera Utara - 2025

    Universitas Sumatera Utara

    Perpustakaan

    Resource Guide

    Katalog Perpustakaan

    Journal Elektronik Berlangganan

    Buku Elektronik Berlangganan

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV