ConVLM: Concept-Guided Vision-Language Models for Explainable Dermatological Diagnosis
DOI:
https://doi.org/10.31224/5064Abstract
Accurate and interpretable diagnosis of dermatological lesions is crucial but challenging due to data scarcity, morphological diversity, and the "black-box" nature of traditional deep learning models. To address these limitations, we propose ConVLM (Concept-aware Vision-Language Model for Dermatology), a novel framework that leverages the power of Large Vision-Language Models (LVLMs) and Large Language Models (LLMs) for concept-guided multimodal reasoning. ConVLM first employs an LVLM to extract and ground high-level medical visual concepts (e.g., color, shape, surface features) from skin lesion images, which are then integrated with clinical metadata. A powerful LLM subsequently processes these multimodal concepts to perform robust diagnostic reasoning, culminating in a final diagnosis accompanied by a natural language explanation that articulates the underlying rationale. Experiments on the challenging SkinCon dataset demonstrate that ConVLM not only achieves competitive or superior diagnostic performance (87.21% BACC, 81.05% F1) but also significantly enhances model interpretability, as validated by human evaluation with dermatologists (4.6/5 clarity, 4.3/5 utility). Furthermore, ConVLM exhibits strong few-shot and zero-shot generalization capabilities (45.1% BACC in 0-shot), crucial for rare conditions. Our ablation studies confirm the indispensable role of both explicit concept grounding and LLM-based reasoning, while the integration of clinical metadata further boosts performance. ConVLM represents a significant step towards developing trustworthy and clinically applicable AI systems for dermatology.
Downloads
Downloads
Posted
License
Copyright (c) 2025 Alexander Davis

This work is licensed under a Creative Commons Attribution 4.0 International License.