Respiratory diseases are among the leading causes of morbidity worldwide, making timely and accurate diagnosis essential. However, interpreting chest X-rays is challenging due to the variability of pathological manifestations and the subjectivity of human analysis. In this study, we propose a multimodal approach that integrates automated image analysis with textual clinical data, leveraging a Visual Question Answering (VQA)based architecture and a text generation model for diagnostic report production. The use of Grad-CAM enhances the interpretability of the system by highlighting the most relevant image regions for diagnosis. The model was trained on a balanced dataset obtained by merging three sources-Lung X-ray Data, NIH Chest X-rays, and Chest X-Ray Images-ensuring fair classification across Normal and Pneumonia categories. The pipeline includes visual feature extraction using a Vision Transformer (ViT), automatic pathology classification, and diagnostic report generation with an advanced language model. Results indicate a significant improvement in diagnostic accuracy compared to traditional methods, supported by key performance metrics such as accuracy 95.3%, sensitivity, specificity, and F1-score. Furthermore, integrating the system into an interactive web app facilitates clinical adoption, enhancing diagnostic efficiency and supporting personalized management of pulmonary diseases.
Visual Question Answering and XAI: Multimodal Approach for Automatic Diagnosis from Lung Radiographs
Antonio, Agliata;Mariano, Caiazzo;Angelo, Ciaramella;Emanuel, Di Nardo;Antonio, Pilato;
2025-01-01
Abstract
Respiratory diseases are among the leading causes of morbidity worldwide, making timely and accurate diagnosis essential. However, interpreting chest X-rays is challenging due to the variability of pathological manifestations and the subjectivity of human analysis. In this study, we propose a multimodal approach that integrates automated image analysis with textual clinical data, leveraging a Visual Question Answering (VQA)based architecture and a text generation model for diagnostic report production. The use of Grad-CAM enhances the interpretability of the system by highlighting the most relevant image regions for diagnosis. The model was trained on a balanced dataset obtained by merging three sources-Lung X-ray Data, NIH Chest X-rays, and Chest X-Ray Images-ensuring fair classification across Normal and Pneumonia categories. The pipeline includes visual feature extraction using a Vision Transformer (ViT), automatic pathology classification, and diagnostic report generation with an advanced language model. Results indicate a significant improvement in diagnostic accuracy compared to traditional methods, supported by key performance metrics such as accuracy 95.3%, sensitivity, specificity, and F1-score. Furthermore, integrating the system into an interactive web app facilitates clinical adoption, enhancing diagnostic efficiency and supporting personalized management of pulmonary diseases.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


