Bridging Visual and Linguistic Intelligence for Chest X-rays: A Comprehensive Review of ViTs and LLM Synergies
DOI:
https://doi.org/10.31224/5661Abstract
The integration of Vision Transformers (ViTs) and Large Language Models (LLMs) in chest X-ray analysis has emerged as a promising solution to address the growing chal- lenges in radiology, including increasing diagnostic workloads and the need for timely, accurate interpretations. This systematic review examines the recent advancements in ViT–LLM hybrid systems, exploring their architectural innovations, multimodal fusion strategies, and application in automated report generation. A comprehensive search of databases such as Google Scholar, PubMed, and IEEE Xplore was conducted to identify studies pub- lished between 2018 and 2025, focusing on ViT–LLM integration, performance metrics, and clinical validation. Key findings high- light that ViT–LLM models significantly improve diagnostic accu- racy, with a 15% improvement in pneumonia detection compared to traditional CNN-based models. These systems also excel at producing clinically relevant reports, achieving a 93% alignment rate with clinician-generated reports. Research demonstrates that ViT–LLM hybrid models reduce diagnostic errors, enhance radiology workflow efficiency, and support clinical decision- making by offering real-time assistance. However, challenges related to computational complexity, data biases, and regulatory approval remain, posing barriers to widespread clinical adoption. Future directions include optimizing these models for real- time deployment, addressing ethical concerns, and integrating them into clinical settings with minimal disruption to existing workflows. The review points out the opportunity for ViT–LLM systems to enhance both diagnostic performance and patient care, offering a transformative tool for the future of radiology.
Downloads
Downloads
Posted
License
Copyright (c) 2025 Mridul Banik

This work is licensed under a Creative Commons Attribution 4.0 International License.