Retinal Diseases Classification From OCT Images Using Pretrained Dual-Encoder Architecture
Email tác giả liên hệ:
vinhlv@hcmute.edu.vnDOI:
https://doi.org/10.54644/jte.2026.2062Từ khóa:
Multimodal, Classification, Retinal diseases, OCT images, Deep learningTóm tắt
Retinal diseases, such as age-related macular degeneration (AMD), diabetic retinopathy (DR), and glaucoma, are leading causes of irreversible vision loss, making early and accurate diagnosis essential for effective treatment. Optical coherence tomography (OCT) provides high-resolution cross-sectional retinal images that support disease assessment; however, many challenges still remain due to noise and artifacts in images or complex retinal structures. In this study, we propose a dual-encoder framework for retinal disease classification from OCT B-scan images by jointly leveraging two pretrained foundation models: RETFound and MIRAGE. Following standardized preprocessing and resampling, high-quality features extracted from the encoders are combined for the final classification tasks. To mitigate overfitting on limited medical data, the RETFound encoder is frozen during training to preserve general visual features, whereas the MIRAGE encoder is fine-tuned to adapt to specific classification objectives. Extensive experiments conducted on seven public OCT benchmark datasets demonstrate that the proposed method outperforms single-encoder baselines on the majority of benchmarks. The framework achieved an average balanced accuracy (BAcc) of 89.8%, an F1-score of 90.7%, and a Matthews Correlation Coefficient (MCC) of 83.9%. These results confirm the effectiveness of combining complementary pretrained encoders for robust and generalizable retinal disease classification in clinical settings.
Tải xuống: 0
Tài liệu tham khảo
M. J. Burton et al., “The Lancet Global Health Commission on global eye health: Vision beyond 2020,” The Lancet Global Health, vol. 9, pp. e489–e551, 2021.
J. S. Schuman, J. G. Fujimoto, J. J. Duker, and H. Ishikawa, Optical Coherence Tomography of Ocular Diseases. Boca Raton, FL, USA: CRC Press, 2024.
V. Srinivasan et al., “Fully automated detection of diabetic macular edema and dry age-related macular degeneration from optical coherence tomography images,” Biomedical Optics Express, vol. 5, no. 10, pp. 3568–3577, 2014.
A. Albarrak, F. Coenen, and Y. Zheng, “Age-related macular degeneration identification in volumetric optical coherence tomography using decomposition and local feature extraction,” in Proc. Int. Conf. Medical Image Understanding and Analysis (MIUA), 2013, pp. 59–64.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learning Representations (ICLR), 2015.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
A. Kamble et al., “Automated diabetic macular edema (DME) analysis using fine-tuning with Inception-ResNet-v2 on OCT images,” in Proc. IEEE-EMBS Int. Conf. Biomedical Engineering and Sciences (IECBES), 2018, pp. 442–446.
O. Perdomo et al., “Oct-net: A convolutional network for automatic classification of normal and diabetic macular edema using SD-OCT volumes,” in Proc. IEEE Int. Symp. Biomedical Imaging (ISBI), 2018, pp. 1423–1426.
N. Tran, “Segmentation on chest CT imaging in COVID-19 based on the improved attention U-Net model,” in Proc. Int. Conf. Intelligent Software Methodologies, Tools, and Techniques (SOMET), 2022, pp. 596–606.
V. Pham et al., “Robust engineering-based unified biomedical imaging framework for liver tumor segmentation,” Current Medical Imaging, vol. 19, no. 1, pp. 37–45, 2023.
W. Nazih, A. O. Aseeri, O. Y. Atallah, and S. El-Sappagh, “Vision transformer model for predicting the severity of diabetic retinopathy in fundus photography-based retina images,” IEEE Access, vol. 11, pp. 117546–117561, 2023.
D. Philippi, K. Rothaus, and M. Castelli, “A vision transformer architecture for the automated segmentation of retinal lesions in spectral-domain optical coherence tomography images,” Scientific Reports, vol. 13, no. 1, art. no. 517, 2023.
A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learning Representations (ICLR), 2021.
Y. Zhou et al., “A foundation model for generalizable disease detection from retinal images,” Nature, vol. 622, no. 7981, pp. 156–163, Sep. 2023.
S. Azizi et al., “Robust and efficient medical imaging with self-supervision,” arXiv preprint arXiv:2205.09723, 2022.
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16000–16009.
Y. Liu et al., “Contrastive predictive coding with transformer for video representation learning,” Neurocomputing, vol. 482, pp. 154–162, 2022.
L. Zhang, C. Bao, and K. Ma, “Self-distillation: Towards efficient and compact neural networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 8, pp. 4388–4403, Aug. 2022.
M. Oquab et al., “DINOv2: Learning robust visual features without supervision,” Transactions on Machine Learning Research, 2024.
X. Li et al., “Multi-modal multi-instance learning for retinal disease recognition,” in Proc. 29th ACM Int. Conf. Multimedia (MM’21), 2021, pp. 2474–2482.
W. Wang et al., “Learning two-stream CNN for multi-modal age-related macular degeneration categorization,” IEEE J. Biomed. Health Inform., vol. 26, no. 11, pp. 5565–5576, Nov. 2022.
Y. Li et al., “Multimodal information fusion for glaucoma and diabetic retinopathy classification,” in Ophthalmic Medical Image Analysis (OMIA). Cham, Switzerland: Springer, 2022, pp. 53–62.
D. Shi et al., “Eyefound: A multimodal generalist foundation model for ophthalmic imaging,” arXiv preprint arXiv:2405.11338, 2024.
D. Shi et al., “A multimodal visual–language foundation model for computational ophthalmology,” npj Digital Medicine, vol. 8, no. 1, art. no. 381, 2025.
J. Morano et al., “Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis,” npj Digital Medicine, vol. 8, no. 1, art. no. 576, Sep. 2025.
R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir, “MultiMAE: Multi-modal multi-task masked autoencoders,” in Proc. Eur. Conf. Computer Vision (ECCV). Cham, Switzerland: Springer, 2022, pp. 348–367.
C. Lanczos and J. Boyd, Discourse on Fourier Series. Philadelphia, PA, USA: SIAM, 2016.
R. Yacouby and D. Axman, “Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models,” in Proc. Workshop on Evaluation and Comparison of NLP Systems, 2020, pp. 79–91.
D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, art. no. 6, 2020.
J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (ROC) curve,” Radiology, vol. 143, no. 1, pp. 29–36, 1982.
L. Mosley, A Balanced Approach to the Multi-Class Imbalance Problem, Ph.D. dissertation, Iowa State University, Ames, IA, USA, 2013.
Tải xuống
Đã Xuất bản
Cách trích dẫn
Giấy phép
Bản quyền (c) 2026 Tạp chí Khoa học Giáo dục Kỹ Thuật
Tác phẩm này được cấp phép theo Giấy phép quốc tế Creative Commons Attribution-NonCommercial 4.0 .
Bản quyền thuộc về JTE.


