Comparative Study of Deep Learning Architectures for Automated Diabetic Retinopathy Grading: Vision Transformer, Swin Transformer, and InceptionResNetV2

Prajwal Khandait

doi:10.31224/7150

##article.authors##

Prajwal Khandait Shivaji University Kolhapur

DOI:

https://doi.org/10.31224/7150

Abstract

Diabetic Retinopathy (DR) is a vision-threatening complication of diabetes mellitus that progresses silently through five clinically defined severity grades. Timely automated screening is critical to prevent irreversible vision loss, particularly in resource-constrained healthcare settings. This paper presents a systematic comparative study of three state-ofthe-art deep learning architectures-Vision Transformer (ViTBase/16), Swin Transformer (swin base patch4 window7 224), and InceptionResNetV2-applied to five-class DR grading on the APTOS 2019 fundus image dataset (3,662 images). All models employ transfer learning from ImageNet-pretrained weights. We analyze each architecture from the perspectives of classification accuracy, per-class F1-score, macro-averaged AUC, GradCAMbased explainability, training dynamics, and parameter efficiency. Our ViT-Base/16 model, fine-tuned end-to-end with AdamW, cosine annealing, and label smoothing, achieves the highest validation accuracy of 85.40% with a macro-averaged F1-score of 0.7247. Swin Transformer achieves 83.20% accuracy, while InceptionResNetV2 achieves 81.40% through two-stage transfer learning. GradCAM visualizations confirm clinically aligned lesion localization across all architectures. This work provides architectural insights for deploying robust DR screening systems in clinical environments.

Downloads

Download data is not yet available.

Comparative Study of Deep Learning Architectures for Automated Diabetic Retinopathy Grading: Vision Transformer, Swin Transformer, and InceptionResNetV2

##article.authors##

DOI:

Abstract

Downloads

Additional Files

Posted

License

Latest preprints