Bridging Visual and Linguistic Intelligence for Chest X-rays: A Comprehensive Review of ViTs and LLM Synergies

Mridul Banik

doi:10.31224/5661

##article.authors##

Mridul Banik COLORADO STATE UNIVERSITY

DOI:

https://doi.org/10.31224/5661

Abstract

The integration of Vision Transformers (ViTs) and Large Language Models (LLMs) in chest X-ray analysis has emerged as a promising solution to address the growing chal- lenges in radiology, including increasing diagnostic workloads and the need for timely, accurate interpretations. This systematic review examines the recent advancements in ViT–LLM hybrid systems, exploring their architectural innovations, multimodal fusion strategies, and application in automated report generation. A comprehensive search of databases such as Google Scholar, PubMed, and IEEE Xplore was conducted to identify studies pub- lished between 2018 and 2025, focusing on ViT–LLM integration, performance metrics, and clinical validation. Key findings high- light that ViT–LLM models significantly improve diagnostic accu- racy, with a 15% improvement in pneumonia detection compared to traditional CNN-based models. These systems also excel at producing clinically relevant reports, achieving a 93% alignment rate with clinician-generated reports. Research demonstrates that ViT–LLM hybrid models reduce diagnostic errors, enhance radiology workflow efficiency, and support clinical decision- making by offering real-time assistance. However, challenges related to computational complexity, data biases, and regulatory approval remain, posing barriers to widespread clinical adoption. Future directions include optimizing these models for real- time deployment, addressing ethical concerns, and integrating them into clinical settings with minimal disruption to existing workflows. The review points out the opportunity for ViT–LLM systems to enhance both diagnostic performance and patient care, offering a transformative tool for the future of radiology.

Downloads

Download data is not yet available.

Bridging Visual and Linguistic Intelligence for Chest X-rays: A Comprehensive Review of ViTs and LLM Synergies

##article.authors##

DOI:

Abstract

Downloads

Downloads

Posted

License

Latest preprints