基于临床视觉先验引导的掩码图像建模可提升胸部X光诊断性能

doi:10.12122/j.issn.1673-4254.2026.04.23

摘要/Abstract

摘要：

目的构建一种融合临床视觉先验的掩码图像建模框架，以提升胸部X光片的语义理解与诊断性能。方法提出视觉先验引导的掩码图像重建框架（VP-MIM）。该方法利用放射科医生的眼动数据在掩码阶段区分诊断相关与无关区域，并据此实施注意力引导的掩码策略；同时构建金字塔注意力重建模块，在重建阶段引入多尺度监督，并结合语义感知的眼动热力图以优化特征学习。结果在RSNA Pneumonia与ChestXray-14两个公开数据集上，VP-MIM的预训练质量在线性评估这一实验设置中展现出显著优势。在仅2616个预训练样本下，于RSNA Pneumonia单标签分类中获得AUC=86.83，在ChestXray-14多标签分类中达到平均AUC（mAUC）=72.82。在全量微调实验设置中，随着预训练数据规模的扩大，VP-MIM在ChestXray-14上取得mAUC=85.49的性能，验证了其良好的可扩展性和其在实际诊断任务中的优秀性能。结论 VP-MIM有效缓解了医学影像中MIM方法的语义损失与多尺度不足问题，提升了胸片诊断性能，并可以在数据规模扩展中保持稳定增益。

关键词: 掩码图像建模, 胸部X光, 人类视觉注意力, 计算机辅助诊断

Abstract:

Objective To develop a masked image modeling framework that integrates clinical visual priors to enhance semantic understanding and diagnostic performance on chest X-ray images. Methods A novel framework VP-MIM was constructed by incorporating clinical visual priors into the MIM process. Eye-tracking data from radiologists were used to distinguish diagnostically relevant from irrelevant regions during the masking phase, enabling a controlled masking strategy. In the reconstruction phase, a pyramid attentive reconstruction module was developed to introduce multi-scale supervision, which was further refined by semantic-aware recalibrated gaze heatmaps to optimize feature learning. Results Experiments conducted on the RSNA Pneumonia and ChestXray-14 public datasets showed that under linear evaluation with only 2616 pre-training samples, VP-MIM achieved an AUC of 86.83 on the RSNA Pneumonia single-label classification task and a mean AUC (mAUC) of 72.82 on the ChestXray-14 multi-label classification task. In full fine-tuning experiments, VP-MIM showed strong scalability when the amount of pre-training data increased, reaching an mAUC of 85.49 on ChestXray-14, which verified good scalability and excellent performance of this model in practical diagnostic tasks. Conclusion VP-MIM alleviates the limitations of semantic loss and insufficient multi-scale modeling in medical imaging MIM to result in improved diagnostic performance of chest X-ray.

Key words: masked image modeling, chest X-ray, human visual attention, computer-aided diagnosis

王璟儿, 张煜. 基于临床视觉先验引导的掩码图像建模可提升胸部X光诊断性能[J]. 南方医科大学学报, 2026, 46(4): 946-955.

Jinger WANG, Yu ZHANG. Visual prior-guided masked image modeling enhances chest X-ray diagnostic efficacy[J]. Journal of Southern Medical University, 2026, 46(4): 946-955.

图/表 9

图1 不同尺度胸部异常及其与注视数据的对应关系

Fig.1 Illustration of thoracic abnormalities of varying sizes and the alignment of gaze data with these regions. A: Prominent heart. Low lung volumes.Left perihilar opacity. No large effusion. This could represent very early congestion or an early left suprahilar infiltrate. B: Mean gaze density for two kinds of areas.

图2 VP-MIM整体框架示意图

Fig.2 Overview of the proposed VP-MIM (upper-left part) by using ViT-B/16 as the default vision backbone. VP-MIM integrates (1) attention-controlled masking to preserve clinically significant features based on radiologist attention, and (2) pyramid attentive reconstruction to reconstruct multi-scale HOG features using embeddings from selected transformer layers. The design of each attentive reconstruction module is shown in the upper-right panel.

表1 不同MIM方法在线性评估实验设置中的定量比较

Tab.1 Quantitative comparison between all MIM methods considering the linear probing performance

Methods	RSNA			CXR-14
Methods	1%	10%	100%	1%	10%	100%
MAE^[3]	79.19 (0.06)	83.04 (0.01)	84.64 (0.01)	56.91 (0.59)	63.73 (0.56)	68.33 (0.34)
MaskFeat^[26]	80.99 (0.23)	84.29 (0.11)	85.67 (0.19)	61.18 (0.62)	66.19 (1.16)	70.21 (0.15)
dMAE^[27]	81.74 (0.40)	84.65 (0.10)	85.70 (0.03)	58.71 (1.53)	65.41 (1.73)	69.88 (0.42)
MaskAlign^[24]	79.64 (0.82)	84.56 (0.12)	85.26 (0.01)	58.37 (0.93)	65.10 (3.96)	70.52 (1.47)
HPM^[28]	82.01 (0.03)	84.30 (0.02)	85.46 (0.02)	59.93 (0.18)	65.57 (2.48)	70.87 (0.71)
AttnMA^[24]	79.15 (0.98)	84.06 (0.15)	85.09 (0.02)	57.55 (0.34)	64.81 (3.42)	70.01 (1.42)
DeepMIM^[25]	81.89 (0.36)	83.96 (0.12)	85.18 (0.11)	61.97 (0.35)	66.41 (1.52)	71.20 (0.64)
VP-MIM	83.51 (0.08)	85.49 (0.05)	86.83 (0.01)	63.48 (0.22)	68.53 (0.68)	72.82 (0.02)

图3 针对缩放系数的消融实验(线性评估)

Fig.3 Ablation study on rescaling factors (linear probing).

图4 针对所选取的特征层的消融实验(线性评估)

Fig.4 Ablation study on source layers (linear probing).

表2 针对关键模块的消融实验

Tab.2 Ablation study on key components

Attention- controlled masking	Pyramid attentive reconstruction			mAUC (×100)
Attention- controlled masking	HOG	Pyramid	Attentive	mAUC (×100)
√				68.97 (0.16)
√	√			70.52 (0.44)
√	√	√		72.18 (0.17)
√	√	√	√	72.82 (0.02)

图5 关键模块消融实验的注意力图可视化结果

Fig.5 Attention map of ablation study on key components.

图6 针对临床掩码率的消融实验

Fig.6 Ablation study on clinical masking ratio rc. The rc changes while the global masking ratio r=0.75 remain fixed throughout pre-training.

图7 不断提升预训练样本数量并在CXR-14数据上进行全量微调评估

Fig. 7 Fine-tuning performance w.r.t. scaling pre-training data.

参考文献 48

[1]	Bansal T, Beese R. Interpreting a chest X-ray[J]. Br J Hosp Med, 2019, 80(5): C75-9. doi：10.12968/hmed.2019.80.5.c75
[2]	Bao H, Dong L, Piao S, et al. BEiT: BERT pre-training of image transformers[C/OL]//The Tenth International Conference on Learning Representations (ICLR). 2022. Available from:
[3]	He KM, Chen XL, Xie SN, et al. Masked autoencoders are scalable vision learners[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 18-24, 2022, New Orleans, LA, USA. IEEE, 2022: 16000-09. doi：10.1109/cvpr52688.2022.01553
[4]	Huang SC, Shen LY, Lungren MP, et al. GLoRIA: a multimodal global-local representation learning framework for label-efficient medical image recognition[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). October 10-17, 2021, Montreal, QC, Canada. IEEE, 2022: 3922-31. doi：10.1109/iccv48922.2021.00391
[5]	Li JX, Wang YQ, Wang S, et al. Multiscale attention guided network for COVID-19 diagnosis using chest X-ray images[J]. IEEE J Biomed Health Inform, 2021, 25(5): 1336-46. doi：10.1109/jbhi.2021.3058293
[6]	Wang S, Ouyang X, Liu TM, et al. Follow my eye: using gaze to supervise computer-aided diagnosis[J]. IEEE Trans Med Imag, 2022, 41(7): 1688-98. doi：10.1109/tmi.2022.3146973
[7]	Ibragimov B, Mello-Thoms C. The use of machine learning in eye tracking studies in medical imaging: a review[J]. IEEE J Biomed Health Inform, 2024, 28(6): 3597-612. doi：10.1109/jbhi.2024.3371893
[8]	Karargyris A, Kashyap S, Lourentzou I, et al. Creation and validation of a chest X-ray dataset with eye-tracking and report dictation for AI development[J]. Sci Data, 2021, 8(1): 92. doi：10.1038/s41597-021-00863-5
[9]	Alsharid M, Cai YF, Sharma H, et al. Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks[J]. Med Image Anal, 2022, 82: 102630. doi：10.1016/j.media.2022.102630
[10]	Zhao ZH, Wang S, Wang Q, et al. Mining gaze for contrastive learning toward computer-assisted diagnosis[J]. Proc AAAI Conf Artif Intell, 2024, 38(7): 7543-51. doi：10.1609/aaai.v38i7.28586
[11]	Kong Y, Wang S, Cai JD, et al. Gaze-DETR: using expert gaze to reduce false positives in vulvovaginal candidiasis screening[C]//Medical Image Computing and Computer Assisted Intervention- MICCAI 2024. Cham: Springer, 2024: 133-43. doi：10.1007/978-3-031-72083-3_13
[12]	Bigolin Lanfredi R, Zhang MY, Auffermann WF, et al. REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays[J]. Sci Data, 2022, 9(1): 350. doi：10.1038/s41597-022-01441-z
[13]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C/OL]//The Ninth International Conference on Learning Representations (ICLR). 2021. Available from:
[14]	Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). June 20-25, 2005, San Diego, CA, USA. IEEE, 2005: 886-93. doi：10.1109/cvpr.2005.4
[15]	Chen SH, Qin J, Ji X, et al. Automatic scoring of multiple semantic attributes with multi-task feature leverage: a study on pulmonary nodules in CT images[J]. IEEE Trans Med Imag, 2017, 36(3): 802-14. doi：10.1109/tmi.2016.2629462
[16]	Shin HC, Roth HR, Gao MC, et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning[J]. IEEE Trans Med Imag, 2016, 35(5): 1285-98. doi：10.1109/tmi.2016.2528162
[17]	Ghalati MK, Nunes A, Ferreira H, et al. Texture analysis and its applications in biomedical imaging: a survey[J]. IEEE Rev Biomed Eng, 2022, 15: 222-46. doi：10.1109/rbme.2021.3115703
[18]	Li CY, Wang XY, Eberl S, et al. Supervised variational model with statistical inference and its application in medical image segmentation[J]. IEEE Trans Biomed Eng, 2015, 62(1): 196-207. doi：10.1109/tbme.2014.2344660
[19]	Huang XW, Zhang YL, Meng L, et al. Identification of ultrasonic echolucent carotid plaques using discrete Fréchet distance between bimodal gamma distributions[J]. IEEE Trans Biomed Eng, 2018, 65(5): 949-55. doi：10.1109/tbme.2017.2676129
[20]	Otsu N. A threshold selection method from gray-level histograms[J]. IEEE Trans Syst Man Cybern, 1979, 9(1): 62-6. doi：10.1109/tsmc.1979.4310076
[21]	Johnson AEW, Pollard TJ, Berkowitz SJ, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports[J]. Sci Data, 2019, 6(1): 317. doi：10.1038/s41597-019-0322-0
[22]	Shih G, Wu CC, Halabi SS, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia[J]. Radiol Artif Intell, 2019, 1(1): e180041. doi：10.1148/ryai.2019180041
[23]	Wang XS, Peng YF, Lu L, et al. ChestX-ray: hospital-scale chest X-ray database and benchmarks on weakly supervised classification and localization of common Thorax diseases[M]//Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics. Cham: Springer International Publishing, 2019: 369-92. doi：10.1007/978-3-030-13969-8_18
[24]	Xue HW, Gao P, Li HY, et al. Stare at what you see: masked image modeling without reconstruction[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 17-24, 2023, Vancouver, BC, Canada. IEEE, 2023: 22732-41. doi：10.1109/cvpr52729.2023.02177
[25]	Ren SC, Wei FY, Zhang SAZ, et al. DeepMIM: deep supervision for masked image modeling[C]//2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). February 26 - March 6, 2025, Tucson, AZ, USA. IEEE, 2025: 879-88. doi：10.1109/wacv61041.2025.00095
[26]	Wei C, Fan HQ, Xie SN, et al. Masked feature prediction for self-supervised visual pre-training[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 18-24, 2022, New Orleans, LA, USA. IEEE, 2022: 14648-58. doi：10.1109/cvpr52688.2022.01426
[27]	Wu QL, Ye H, Gu Y, et al. Denoising masked autoencoders help robust classification[C/OL]//The Eleventh International Conference on Learning Representations(ICLR). May 1-5, 2023, Kigali, Rwanda. Available from:
[28]	Wang HC, Song KY, Fan JS, et al. Hard patches mining for masked image modeling[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 17-24, 2023, Vancouver, BC, Canada. IEEE, 2023: 10375-85. doi：10.1109/cvpr52729.2023.01000
[29]	Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[J]. Int J Comput Vis, 2020, 128(2): 336-59. doi：10.1007/s11263-019-01228-7
[30]	Yosinski J, Clune J, Bengio Y, et al. How transferable are features in deep neural networks[C]//The Twenty-Eighth Annual Conference on Neural Information Processing Systems(NeurIPS). December 8-13, 2014, Montreal, QC, Canda. IEEE 2014, 27: 3320-28.
[31]	Duan JX, Zhang MW, Song MH, et al. Eye tracking-enhanced deep learning for medical image analysis: a systematic review on data efficiency, interpretability, and multimodal integration[J]. Bioengineering, 2025, 12(9): 954. doi：10.3390/bioengineering12090954
[32]	Wang B, Pan HY, Aboah A, et al. GazeGNN: a gaze-guided graph neural network for chest X-ray classification[C]//2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). January 3-8, 2024, Waikoloa, HI, USA. IEEE, 2024: 2183-92. doi：10.1109/wacv57701.2024.00219
[33]	Ma C, Zhao L, Chen YZ, et al. Eye-gaze-guided vision transformer for rectifying shortcut learning[J]. IEEE Trans Med Imag, 2023, 42(11): 3384-94. doi：10.1109/tmi.2023.3287572
[34]	Chen ZH, Liu Z, Song YJ. Gaze-guided vision transformer for chest X-ray image classification[J]. Biomed Signal Process Control, 2026, 111: 108298. doi：10.1016/j.bspc.2025.108298
[35]	Suara S, Jha A, Sinha P, et al. Is grad-CAM explainable in Medical images? [C]//Computer Vision and Image Processing. Cham: Springer, 2024: 124-35. doi：10.1007/978-3-031-58181-6_11
[36]	Santos R, Pedrosa J, Mendonça AM, et al. Grad-CAM: The impact of large receptive fields and other caveats[J]. Comput Vis Image Underst, 2025, 258: 104383. doi：10.1016/j.cviu.2025.104383
[37]	Xiang WZ, Liu C, Yu HY, et al. Wavelet-driven masked image modeling: a path to efficient visual representation[J]. Proc AAAI Conf Artif Intell, 2025, 39(8): 8611-9. doi：10.1609/aaai.v39i8.32930
[38]	Liu C, Cheng YZ, Tamura S. Masked image modeling-based boundary reconstruction for 3D medical image segmentation[J]. Comput Biol Med, 2023, 166: 107526. doi：10.1016/j.compbiomed.2023.107526
[39]	Sultana J, Qin RW, Yin ZZ. Seeing through expert's eyes: leveraging radiologist eye gaze and speech report with graph neural networks for chest X-ray image classification[C]//Computer Vision-ACCV 2024. Singapore: Springer, 2025: 142-58. doi：10.1007/978-981-96-0901-7_9
[40]	Pershin I, Mustafaev T, Ibragimova D, et al. Changes in radiologists' gaze patterns against lung X-rays with different abnormalities: a randomized experiment[J]. J Digit Imaging, 2023, 36(3): 767-75. doi：10.1007/s10278-022-00760-2
[41]	Cho K, Kim KD, Nam Y, et al. CheSS: chest X-ray pre-trained model via self-supervised contrastive learning[J]. J Digit Imaging, 2023, 36(3): 902-10. doi：10.1007/s10278-023-00782-4
[42]	Shurrab S, Guerra-Manzanares A, Shamout FE. Multimodal masked Siamese network improves chest X-ray representation learning[J]. Sci Rep, 2024, 14: 22516. doi：10.1038/s41598-024-74043-x
[43]	Kong LJ, Ma MQ, Chen GY, et al. Understanding masked autoencoders via hierarchical latent variable models[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 17-24, 2023, Vancouver, BC, Canada. IEEE, 2023: 7918-28. doi：10.1109/cvpr52729.2023.00765
[44]	Yang QS, Li WY, Li BP, et al. MRM: masked relation modeling for medical image pre-training with genetics[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). October 1-6, 2023, Paris, France. IEEE, 2024: 21395-405. doi：10.1109/iccv51070.2023.01961
[45]	Dai PY, Ou YF, Yang YQ, et al. SaSaMIM: synthetic anatomical semantics-aware masked image modeling for colon tumor segmentation in non-contrast abdominal computed tomography[C]//Medical Image Computing and Computer Assisted Intervention- MICCAI 2024. Cham: Springer, 2024: 567-78. doi：10.1007/978-3-031-72120-5_53
[46]	Xie YT, Gu L, Harada T, et al. MedIM: boost medical image representation via radiology report-guided masking[C]//Medical Image Computing and Computer Assisted Intervention-MICCAI 2023. Cham: Springer, 2023: 13-23. doi：10.1007/978-3-031-43907-0_2
[47]	Wang WX, Wang J, Chen C, et al. FreMIM: Fourier transform meets masked image modeling for medical image segmentation[C]//2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). January 3-8, 2024, Waikoloa, HI, USA. IEEE, 2024: 7845-55. doi：10.1109/wacv57701.2024.00768
[48]	Xing ZH, Zhu L, Yu LQ, et al. Hybrid masked image modeling for 3D medical image segmentation[J]. IEEE J Biomed Health Inform, 2024, 28(4): 2115-25. doi：10.1109/jbhi.2024.3360239