南方医科大学学报 ›› 2026, Vol. 46 ›› Issue (4): 946-955.doi: 10.12122/j.issn.1673-4254.2026.04.23

• • 上一篇    

基于临床视觉先验引导的掩码图像建模可提升胸部X光诊断性能

王璟儿(), 张煜()   

  1. 南方医科大学生物医学工程学院//广东省医学图像处理重点实验室,广东 广州 510515
  • 收稿日期:2025-09-28 出版日期:2026-04-20 发布日期:2026-04-24
  • 通讯作者: 张煜 E-mail:wmei2933@gmail.com;yuzhang@smu.edu.cn
  • 作者简介:王璟儿,在读硕士研究生, E -mail: wmei2933@gmail.com
  • 基金资助:
    国家自然科学基金(U22A20350)

Visual prior-guided masked image modeling enhances chest X-ray diagnostic efficacy

Jinger WANG(), Yu ZHANG()   

  1. School of Biomedical Engineering, Southern Medical University//Guangdong Provincial Key Laboratory of Medical Image Processing, Guangzhou 510515, China
  • Received:2025-09-28 Online:2026-04-20 Published:2026-04-24
  • Contact: Yu ZHANG E-mail:wmei2933@gmail.com;yuzhang@smu.edu.cn
  • Supported by:
    National Natural Science Foundation of China(U22A20350)

摘要:

目的 构建一种融合临床视觉先验的掩码图像建模框架,以提升胸部X光片的语义理解与诊断性能。 方法 提出视觉先验引导的掩码图像重建框架(VP-MIM)。该方法利用放射科医生的眼动数据在掩码阶段区分诊断相关与无关区域,并据此实施注意力引导的掩码策略;同时构建金字塔注意力重建模块,在重建阶段引入多尺度监督,并结合语义感知的眼动热力图以优化特征学习。 结果 在RSNA Pneumonia与ChestXray-14两个公开数据集上,VP-MIM的预训练质量在线性评估这一实验设置中展现出显著优势。在仅2616个预训练样本下,于RSNA Pneumonia单标签分类中获得AUC=86.83,在ChestXray-14多标签分类中达到平均AUC(mAUC)=72.82。在全量微调实验设置中,随着预训练数据规模的扩大,VP-MIM在ChestXray-14上取得mAUC=85.49的性能,验证了其良好的可扩展性和其在实际诊断任务中的优秀性能。 结论 VP-MIM有效缓解了医学影像中MIM方法的语义损失与多尺度不足问题,提升了胸片诊断性能,并可以在数据规模扩展中保持稳定增益。

关键词: 掩码图像建模, 胸部X光, 人类视觉注意力, 计算机辅助诊断

Abstract:

Objective To develop a masked image modeling framework that integrates clinical visual priors to enhance semantic understanding and diagnostic performance on chest X-ray images. Methods A novel framework VP-MIM was constructed by incorporating clinical visual priors into the MIM process. Eye-tracking data from radiologists were used to distinguish diagnostically relevant from irrelevant regions during the masking phase, enabling a controlled masking strategy. In the reconstruction phase, a pyramid attentive reconstruction module was developed to introduce multi-scale supervision, which was further refined by semantic-aware recalibrated gaze heatmaps to optimize feature learning. Results Experiments conducted on the RSNA Pneumonia and ChestXray-14 public datasets showed that under linear evaluation with only 2616 pre-training samples, VP-MIM achieved an AUC of 86.83 on the RSNA Pneumonia single-label classification task and a mean AUC (mAUC) of 72.82 on the ChestXray-14 multi-label classification task. In full fine-tuning experiments, VP-MIM showed strong scalability when the amount of pre-training data increased, reaching an mAUC of 85.49 on ChestXray-14, which verified good scalability and excellent performance of this model in practical diagnostic tasks. Conclusion VP-MIM alleviates the limitations of semantic loss and insufficient multi-scale modeling in medical imaging MIM to result in improved diagnostic performance of chest X-ray.

Key words: masked image modeling, chest X-ray, human visual attention, computer-aided diagnosis