The generation of textual descriptions of images is defined as Image Captioning which is a complex task that necessitates the integration of both computer vision and natural language processing. Due to the ever-increasing quantity of image data accessible on the internet, image captioning has emerged as a crucial area of study in both computer vision and natural language processing. Several methodologies have been proposed for image captioning, including template-based, retrieval-based, and generative techniques. Generative methods, particularly those that employ deep neural networks to learn image features and produce captions, have achieved remarkable outcomes. However, assessing the quality of generated captions remains a difficult task, as there is no single metric that can accurately evaluate the performance of image captioning models. The purpose of this study is to examine the effectiveness of various image captioning models on a dataset, evaluate the efficacy of diverse assessment metrics, and investigate methods to improve the quality and diversity of generated captions. The goal is to contribute to the progression of image captioning and provide insights into the challenges and prospects in this field. Specifically, this study focuses on a diffusion-based captioning model fine-tuned by a self-critical reinforcement learning technique. Further, the same optimization technique is applied on other image captioning models to prove the efficacy of the fine-tuning strategy and demonstrate its usefulness based on the complexity of the architecture used.

Discrete Diffusion Model for Image Captioning by Self-Critical Learning

Silvio, Vincenzo;Di Nardo, Emanuel
;
Ciaramella, Angelo
2025-01-01

Abstract

The generation of textual descriptions of images is defined as Image Captioning which is a complex task that necessitates the integration of both computer vision and natural language processing. Due to the ever-increasing quantity of image data accessible on the internet, image captioning has emerged as a crucial area of study in both computer vision and natural language processing. Several methodologies have been proposed for image captioning, including template-based, retrieval-based, and generative techniques. Generative methods, particularly those that employ deep neural networks to learn image features and produce captions, have achieved remarkable outcomes. However, assessing the quality of generated captions remains a difficult task, as there is no single metric that can accurately evaluate the performance of image captioning models. The purpose of this study is to examine the effectiveness of various image captioning models on a dataset, evaluate the efficacy of diverse assessment metrics, and investigate methods to improve the quality and diversity of generated captions. The goal is to contribute to the progression of image captioning and provide insights into the challenges and prospects in this field. Specifically, this study focuses on a diffusion-based captioning model fine-tuned by a self-critical reinforcement learning technique. Further, the same optimization technique is applied on other image captioning models to prove the efficacy of the fine-tuning strategy and demonstrate its usefulness based on the complexity of the architecture used.
2025
9789819609932
9789819609949
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11367/152658
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact