南方医科大学学报 ›› 2025, Vol. 45 ›› Issue (6): 1343-1352.doi: 10.12122/j.issn.1673-4254.2025.06.24

• • 上一篇    

CRAKUT:融合对比区域注意力机制与临床先验知识的U-Transformer用于放射学报告生成

梁业东1(), 朱雄峰2, 黄美燕1, 张文聪1, 郭翰宇1, 冯前进1()   

  1. 1.南方医科大学生物医学工程学院//广东省医学图像处理重点实验室,广东 广州 510515
    2.广东医科大学生物医学工程学院,广东 东莞 523808
  • 收稿日期:2025-01-07 出版日期:2025-06-20 发布日期:2025-06-27
  • 通讯作者: 冯前进 E-mail:liangyedongsmu@163.com;fengqj99@smu.edu.cn
  • 作者简介:梁业东,在读硕士研究生,E-mail: liangyedongsmu@163.com
  • 基金资助:
    国家自然科学基金(12126603);琶洲实验室研发项目(2023K0604)

CRAKUT:integrating contrastive regional attention and clinical prior knowledge in U-transformer for radiology report generation

Yedong LIANG1(), Xiongfeng ZHU2, Meiyan HUANG1, Wencong ZHANG1, Hanyu GUO1, Qianjin FENG1()   

  1. 1.School of Biomedical Engineering, Southern Medical University//Guangdong Provincial Key Laboratory of Medical Image Processing, Guangzhou 510515, China
    2.School of Biomedical Engineering, Guangdong Medical University, Dongguan 523808, China
  • Received:2025-01-07 Online:2025-06-20 Published:2025-06-27
  • Contact: Qianjin FENG E-mail:liangyedongsmu@163.com;fengqj99@smu.edu.cn
  • Supported by:
    National Natural Science Foundation of China(12126603)

摘要:

目的 提出一种对比区域注意力和先验知识融合的U型Transformer模型(CRAKUT),旨在解决文本分布不均衡、缺乏上下文临床知识以及跨模态信息转换等问题,提升生成报告的质量,辅助影像科医生诊断工作。 方法 CRAKUT 包括3个关键模块:对比注意力图像编码器,利用数据集中常见的正常影像提取增强的视觉特征;外部知识注入模块,融合临床先验知识;U型Transformer,通过U型连接架构完成从视觉到语言的跨模态信息转换。在图像编码器中引入的对比区域注意力机制,通过强调正常与异常语义特征之间的差异,增强了异常区域的特征表示。此外,文本编码器中的临床先验知识注入模块结合了临床历史信息及由ChatGPT生成的知识图谱,从而提升了报告生成的上下文理解能力。U型Transformer在多模态编码器与报告解码器之间建立连接,融合多种类型的信息以生成最终的报告。 结果 在2个公开的CXR数据集(IU-Xray和MIMIC-CXR)对CRAKUT模型进行评估,结果显示,CRAKUT在报告生成任务中实现了当前最先进的性能。在MIMIC-CXR数据集,CRAKUT取得了BLEU-4分数0.159、ROUGE-L分数0.353、CIDEr分数0.500;在IU-Xray数据集上,METEOR分数达到0.258,均优于以往模型的表现。 结论 本文提出的方法在临床疾病诊断和报告生成中具有巨大的应用潜力。

关键词: 胸部X光, 对比区域注意力, 临床先验知识, 跨模态交互, U-Transformer模型

Abstract:

Objective We propose a Contrastive Regional Attention and Prior Knowledge-Infused U-Transformer model (CRAKUT) to address the challenges of imbalanced text distribution, lack of contextual clinical knowledge, and cross-modal information transformation to enhance the quality of generated radiology reports. Methods The CRAKUT model comprises 3 key components, including an image encoder that utilizes common normal images from the dataset for extracting enhanced visual features, an external knowledge infuser that incorporates clinical prior knowledge, and a U-Transformer that facilitates cross-modal information conversion from vision to language. The contrastive regional attention in the image encoder was introduced to enhance the features of abnormal regions by emphasizing the difference between normal and abnormal semantic features. Additionally, the clinical prior knowledge infuser within the text encoder integrates clinical history and knowledge graphs generated by ChatGPT. Finally, the U-Transformer was utilized to connect the multi-modal encoder and the report decoder in a U-connection schema, and multiple types of information were used to fuse and obtain the final report. Results We evaluated the proposed CRAKUT model on two publicly available CXR datasets (IU-Xray and MIMIC-CXR). The experimental results showed that the CRAKUT model achieved a state-of-the-art performance on report generation with a BLEU-4 score of 0.159, a ROUGE-L score of 0.353, and a CIDEr score of 0.500 in MIMIC-CXR dataset; the model also had a METEOR score of 0.258 in IU-Xray dataset, outperforming all the comparison models. Conclusion The proposed method has great potential for application in clinical disease diagnoses and report generation.

Key words: Chest X-ray, contrastive region attention, clinical prior knowledge, cross-modal, U-Transformer model