Respiratory diseases are among the leading causes of morbidity worldwide, making timely and accurate diagnosis essential. However, interpreting chest X-rays is challenging due to the variability of pathological manifestations and the subjectivity of human analysis. In this study, we propose a multimodal approach that integrates automated image analysis with textual clinical data, leveraging a Visual Question Answering (VQA)based architecture and a text generation model for diagnostic report production. The use of Grad-CAM enhances the interpretability of the system by highlighting the most relevant image regions for diagnosis. The model was trained on a balanced dataset obtained by merging three sources-Lung X-ray Data, NIH Chest X-rays, and Chest X-Ray Images-ensuring fair classification across Normal and Pneumonia categories. The pipeline includes visual feature extraction using a Vision Transformer (ViT), automatic pathology classification, and diagnostic report generation with an advanced language model. Results indicate a significant improvement in diagnostic accuracy compared to traditional methods, supported by key performance metrics such as accuracy 95.3%, sensitivity, specificity, and F1-score. Furthermore, integrating the system into an interactive web app facilitates clinical adoption, enhancing diagnostic efficiency and supporting personalized management of pulmonary diseases.

Visual Question Answering and XAI: Multimodal Approach for Automatic Diagnosis from Lung Radiographs

Antonio, Agliata;Mariano, Caiazzo;Angelo, Ciaramella;Emanuel, Di Nardo;Antonio, Pilato;
2025-01-01

Abstract

Respiratory diseases are among the leading causes of morbidity worldwide, making timely and accurate diagnosis essential. However, interpreting chest X-rays is challenging due to the variability of pathological manifestations and the subjectivity of human analysis. In this study, we propose a multimodal approach that integrates automated image analysis with textual clinical data, leveraging a Visual Question Answering (VQA)based architecture and a text generation model for diagnostic report production. The use of Grad-CAM enhances the interpretability of the system by highlighting the most relevant image regions for diagnosis. The model was trained on a balanced dataset obtained by merging three sources-Lung X-ray Data, NIH Chest X-rays, and Chest X-Ray Images-ensuring fair classification across Normal and Pneumonia categories. The pipeline includes visual feature extraction using a Vision Transformer (ViT), automatic pathology classification, and diagnostic report generation with an advanced language model. Results indicate a significant improvement in diagnostic accuracy compared to traditional methods, supported by key performance metrics such as accuracy 95.3%, sensitivity, specificity, and F1-score. Furthermore, integrating the system into an interactive web app facilitates clinical adoption, enhancing diagnostic efficiency and supporting personalized management of pulmonary diseases.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11367/159963
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact