Comparative Study of Deep Learning Architectures for Automated Diabetic Retinopathy Grading: Vision Transformer, Swin Transformer, and InceptionResNetV2
DOI:
https://doi.org/10.31224/7150Abstract
Diabetic Retinopathy (DR) is a vision-threatening complication of diabetes mellitus that progresses silently through five clinically defined severity grades. Timely automated screening is critical to prevent irreversible vision loss, particularly in resource-constrained healthcare settings. This paper presents a systematic comparative study of three state-ofthe-art deep learning architectures-Vision Transformer (ViTBase/16), Swin Transformer (swin base patch4 window7 224), and InceptionResNetV2-applied to five-class DR grading on the APTOS 2019 fundus image dataset (3,662 images). All models employ transfer learning from ImageNet-pretrained weights. We analyze each architecture from the perspectives of classification accuracy, per-class F1-score, macro-averaged AUC, GradCAMbased explainability, training dynamics, and parameter efficiency. Our ViT-Base/16 model, fine-tuned end-to-end with AdamW, cosine annealing, and label smoothing, achieves the highest validation accuracy of 85.40% with a macro-averaged F1-score of 0.7247. Swin Transformer achieves 83.20% accuracy, while InceptionResNetV2 achieves 81.40% through two-stage transfer learning. GradCAM visualizations confirm clinically aligned lesion localization across all architectures. This work provides architectural insights for deploying robust DR screening systems in clinical environments.
Downloads
Additional Files
Posted
License
Copyright (c) 2026 Prajwal Khandait

This work is licensed under a Creative Commons Attribution 4.0 International License.