Visual Question Answering and XAI: Multimodal Approach for Automatic Diagnosis from Lung Radiographs

Agliata, Antonio; Vittorio, Bilò; Caiazzo, Mariano; Mario, Caruso Antonio; Ciaramella, Angelo; Di Nardo, Emanuel; Pilato, Antonio; Mariacarmen, Sorrentino; Cosimo, Vinci

doi:10.1109/iscc65549.2025.11326236

Respiratory diseases are among the leading causes of morbidity worldwide, making timely and accurate diagnosis essential. However, interpreting chest X-rays is challenging due to the variability of pathological manifestations and the subjectivity of human analysis. In this study, we propose a multimodal approach that integrates automated image analysis with textual clinical data, leveraging a Visual Question Answering (VQA)based architecture and a text generation model for diagnostic report production. The use of Grad-CAM enhances the interpretability of the system by highlighting the most relevant image regions for diagnosis. The model was trained on a balanced dataset obtained by merging three sources-Lung X-ray Data, NIH Chest X-rays, and Chest X-Ray Images-ensuring fair classification across Normal and Pneumonia categories. The pipeline includes visual feature extraction using a Vision Transformer (ViT), automatic pathology classification, and diagnostic report generation with an advanced language model. Results indicate a significant improvement in diagnostic accuracy compared to traditional methods, supported by key performance metrics such as accuracy 95.3%, sensitivity, specificity, and F1-score. Furthermore, integrating the system into an interactive web app facilitates clinical adoption, enhancing diagnostic efficiency and supporting personalized management of pulmonary diseases.

Visual Question Answering and XAI: Multimodal Approach for Automatic Diagnosis from Lung Radiographs

Antonio, Agliata;Vittorio, Bilò;Mariano, Caiazzo;Mario, Caruso Antonio;Angelo, Ciaramella;Emanuel, Di Nardo;Antonio, Pilato;Mariacarmen, Sorrentino;Cosimo, Vinci

2025-01-01

Abstract

Respiratory diseases are among the leading causes of morbidity worldwide, making timely and accurate diagnosis essential. However, interpreting chest X-rays is challenging due to the variability of pathological manifestations and the subjectivity of human analysis. In this study, we propose a multimodal approach that integrates automated image analysis with textual clinical data, leveraging a Visual Question Answering (VQA)based architecture and a text generation model for diagnostic report production. The use of Grad-CAM enhances the interpretability of the system by highlighting the most relevant image regions for diagnosis. The model was trained on a balanced dataset obtained by merging three sources-Lung X-ray Data, NIH Chest X-rays, and Chest X-Ray Images-ensuring fair classification across Normal and Pneumonia categories. The pipeline includes visual feature extraction using a Vision Transformer (ViT), automatic pathology classification, and diagnostic report generation with an advanced language model. Results indicate a significant improvement in diagnostic accuracy compared to traditional methods, supported by key performance metrics such as accuracy 95.3%, sensitivity, specificity, and F1-score. Furthermore, integrating the system into an interactive web app facilitates clinical adoption, enhancing diagnostic efficiency and supporting personalized management of pulmonary diseases.