Optimasi Caption Otomatis: Studi Refinement Caption dari Model Vision-Language Menggunakan GPT-3.5
Automatic Caption Optimization: A Caption Refinement Study of Vision-Language Models Using GPT-3.5

Date
2025Author
Saragih, Andhika Mandalanta
Advisor(s)
Harumy, Henny Febriana
Hardi, Sri Melvani
Metadata
Show full item recordAbstract
Generating relevant and engaging captions for images is a common challenge across various digital platforms, such as social media and e-commerce, making the need for systems that can automatically produce image captions increasingly important. However, vision-language models like BLIP still face limitations in generating descriptions that are truly accurate and contextually appropriate, especially when dealing with ambiguous or complex visual elements. This study aims to optimize the automatic captioning results from the BLIP model through stylistic refinement using GPT-3.5. The initial captions generated by BLIP are adapted into four different language styles Formal, Informal, Social Media, and E-Commerce using a prompt-based approach. Evaluation is conducted quantitatively using BERTScore and subjectively through user surveys. The quantitative results show that the refined captions have a high degree of semantic similarity with the reference captions, with the highest F1 score in the Formal style (0.888) and the lowest in the Social Media style (0.805). Subjective evaluation from 83 users yielded an average score of 4.19 on a 5-point Likert scale, indicating positive user perceptions in terms of readability, relevance, and ease of use. Nevertheless, the system still has limitations, such as visual misinterpretations by BLIP for instance, mistaking patterns on clothing for other objects. This study demonstrates that the integration of BLIP and GPT-3.5 can produce image captions that are adaptive to various communication styles and holds strong potential for further development in multimodal applications.
Collections
- Undergraduate Theses [1235]