CRAKUT：融合对比区域注意力机制与临床先验知识的U-Transformer用于放射学报告生成

doi:10.12122/j.issn.1673-4254.2025.06.24

摘要/Abstract

摘要：

目的提出一种对比区域注意力和先验知识融合的U型Transformer模型（CRAKUT），旨在解决文本分布不均衡、缺乏上下文临床知识以及跨模态信息转换等问题，提升生成报告的质量，辅助影像科医生诊断工作。方法 CRAKUT 包括3个关键模块：对比注意力图像编码器，利用数据集中常见的正常影像提取增强的视觉特征；外部知识注入模块，融合临床先验知识；U型Transformer，通过U型连接架构完成从视觉到语言的跨模态信息转换。在图像编码器中引入的对比区域注意力机制，通过强调正常与异常语义特征之间的差异，增强了异常区域的特征表示。此外，文本编码器中的临床先验知识注入模块结合了临床历史信息及由ChatGPT生成的知识图谱，从而提升了报告生成的上下文理解能力。U型Transformer在多模态编码器与报告解码器之间建立连接，融合多种类型的信息以生成最终的报告。结果在2个公开的CXR数据集（IU-Xray和MIMIC-CXR）对CRAKUT模型进行评估，结果显示，CRAKUT在报告生成任务中实现了当前最先进的性能。在MIMIC-CXR数据集，CRAKUT取得了BLEU-4分数0.159、ROUGE-L分数0.353、CIDEr分数0.500；在IU-Xray数据集上，METEOR分数达到0.258，均优于以往模型的表现。结论本文提出的方法在临床疾病诊断和报告生成中具有巨大的应用潜力。

关键词: 胸部X光, 对比区域注意力, 临床先验知识, 跨模态交互, U-Transformer模型

Abstract:

Objective We propose a Contrastive Regional Attention and Prior Knowledge-Infused U-Transformer model (CRAKUT) to address the challenges of imbalanced text distribution, lack of contextual clinical knowledge, and cross-modal information transformation to enhance the quality of generated radiology reports. Methods The CRAKUT model comprises 3 key components, including an image encoder that utilizes common normal images from the dataset for extracting enhanced visual features, an external knowledge infuser that incorporates clinical prior knowledge, and a U-Transformer that facilitates cross-modal information conversion from vision to language. The contrastive regional attention in the image encoder was introduced to enhance the features of abnormal regions by emphasizing the difference between normal and abnormal semantic features. Additionally, the clinical prior knowledge infuser within the text encoder integrates clinical history and knowledge graphs generated by ChatGPT. Finally, the U-Transformer was utilized to connect the multi-modal encoder and the report decoder in a U-connection schema, and multiple types of information were used to fuse and obtain the final report. Results We evaluated the proposed CRAKUT model on two publicly available CXR datasets (IU-Xray and MIMIC-CXR). The experimental results showed that the CRAKUT model achieved a state-of-the-art performance on report generation with a BLEU-4 score of 0.159, a ROUGE-L score of 0.353, and a CIDEr score of 0.500 in MIMIC-CXR dataset; the model also had a METEOR score of 0.258 in IU-Xray dataset, outperforming all the comparison models. Conclusion The proposed method has great potential for application in clinical disease diagnoses and report generation.

Key words: Chest X-ray, contrastive region attention, clinical prior knowledge, cross-modal, U-Transformer model

梁业东, 朱雄峰, 黄美燕, 张文聪, 郭翰宇, 冯前进. CRAKUT：融合对比区域注意力机制与临床先验知识的U-Transformer用于放射学报告生成[J]. 南方医科大学学报, 2025, 45(6): 1343-1352.

Yedong LIANG, Xiongfeng ZHU, Meiyan HUANG, Wencong ZHANG, Hanyu GUO, Qianjin FENG. CRAKUT:integrating contrastive regional attention and clinical prior knowledge in U-transformer for radiology report generation[J]. Journal of Southern Medical University, 2025, 45(6): 1343-1352.

图/表 8

图1 对比区域注意力和知识融合U形Transformer(CRAKUT)的总体架构

Fig.1 Overall architecture of Contrastive Regional Attention and Knowledge infused U-Transformer (CRAKUT). A: An image encoder with contrastive regional attention module; B: A Transformer that uses a U-shaped connection between the encoder and decoder; C1, C2: The knowledge infuser that fuses reports with knowledge graph and clinical history.

图2 本文提出的对比区域注意力(CRA)模块的示意图

Fig.2 Illustration of the proposed Contrastive Regional Attention (CRA) module. The yellow line represents the flow direction of the input image, and the red box in the final result represents the region that has been enhanced due to the presence of anomalies.

图3 U形Transformer的连接方案通过将浅层编码的结果输入到深层解码中来实现

Fig.3 Connection scheme of U-shaped transformer implement by inputting the results of shallow encoding into deep decoding.

图4 利用ChatGPT高效构建症状与器官关系的医学知识图谱

Fig.4 Efficient construction of a medical knowledge graph illustrating the relationship between symptoms and organs using ChatGPT.

表1 本文所提出的CRAKUT模型与其他最新的先进模型在IU X-ray和MIMIC-CXR数据集上的性能比较

Tab.1 Comparison of the performance of the proposed CRAKUT model with other state-of-the-art models on the IU X-ray and MIMIC-CXR datasets

Methods	MIMIC-CXR				IU-Xray
Methods	BLEU-4	ROUGE-L	METEOR	CIDEr	BLEU-4	ROUGE-L	METEOR	CIDEr
R2Gen^[21]	0.103	0.277	0.142	0.253	0.165	0.371	0.187	0.398
PPKED^[24]	0.106	0.284	0.149	0.237	0.168	0.376	0.190	0.351
MGSK^[29]	0.115	0.284	-	0.203	0.178	0.381	-	0.382
ME^[30]	0.124	0.291	0.152	0.362	0.172	0.380	0.192	0.435
DCL^[31]	0.109	0.284	0.150	0.281	0.163	0.383	0.193	0.586
KiUT^[8]	0.113	0.285	0.160	-	0.185	0.409	0.242	-
RGRG^[32]	0.126	0.264	0.168	0.495	-	-	-	-
Ours	0.159	0.357	0.157	0.500	0.165	0.355	0.258	0.566

表2 MIMIC-CXR数据集上临床效能指标的比较

Tab.2 Comparison of clinical efficacy indicators on the MIMIC-CXR dataset

Methods	Precision	Recall	F1-score
R2Gen	0.333	0.273	0.276
MGSK	0.458	0.348	0.371
ME	0.364	0.309	0.311
DCL	0.471	0.352	0.373
KiUT	0.371	0.318	0.321
RGRG	0.461	0.475	0.447
Ours	0.418	0.303	0.351

图5 真实报告、R2Gen模型和本模型的报告示例

Fig.5 Report illustrations from ground truth, R2Gen[18] and the propose model. For better visualization, different colors highlight different symptoms. The richer the colors are in the report, the more clinical descriptive sentences it contains.

参考文献 32

1	Raoof S, Feigin D, Sung A, et al. Interpretation of plain chest roentgenogram[J]. Chest, 2012, 141(2): 545-58. doi：10.1378/chest.10-1302
2	Jing BY, Xie PT, Xing E. On the automatic generation of medical imaging reports[EB/OL]. 2017. . doi：10.18653/v1/p18-1240
3	Vinyals O, Toshev A, Bengio S, et al. Show and tell: a neural image caption generator[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 3156-64. doi：10.1109/cvpr.2015.7298935
4	Liu FL, Yin CC, Wu X, et al. Contrastive attention for automatic chest X-ray report generation[EB/OL]. 2021. . doi：10.18653/v1/2021.findings-acl.23
5	Johnson AEW, Pollard TJ, Greenbaum NR, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs[EB/OL]. 2019. . doi：10.1038/s41597-019-0322-0
6	Demner-Fushman D, Kohli MD, Rosenman MB, et al. Preparing a collection of radiology examinations for distribution and retrieval[J]. J Am Med Inform Assoc, 2016, 23(2): 304-10. doi：10.1093/jamia/ocv080
7	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners [J]. Adv Neural Information Processing Systems, 2020, 33:1877-901.
8	Huang ZZ, Zhang XF, Zhang ST. KiUT: knowledge-injected U-transformer for radiology report generation[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 19809-18. doi：10.1109/cvpr52729.2023.01897
9	Nguyen HTN, Nie D, Badamdorj T, et al. Automated generation of accurate & fluent medical X-ray reports[EB/OL]. 2021. . doi：10.18653/v1/2021.emnlp-main.288
10	Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 4700-8. doi：10.1109/cvpr.2017.243
11	Huang L, Wang WM, Chen J, et al. Attention on attention for image captioning[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019: 4634-43. doi：10.1109/iccv.2019.00473
12	Vaswani A, Shazeer N, Parmar N, et al. Polosukhin, "Attention is all you need"[J]. Adv Neural Information Processing Systems, 2017,30: 1305.
13	Tran A, Mathews A, Xie LX. Transform and tell: entity-aware news image captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 13035-45. doi：10.1109/cvpr42600.2020.01305
14	Pan YW, Yao T, Li YH, et al. X-linear attention networks for image captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 10971-80. doi：10.1109/cvpr42600.2020.01098
15	Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 10578-87. doi：10.1109/cvpr42600.2020.01059
16	Nguyen VQ, Suganuma M, Okatani T. GRIT: faster and better image captioning transformer using dual visual features[M]//Cham: Springer Nature Switzerland, 2022: 167-84. doi：10.1007/978-3-031-20059-5_10
17	Xu K, Ba J, Kiros R,et al. Show, attend and tell: Neural image caption generation with visual attention[J]. Computer Science, 2015, (2): 2048-57. doi：10.1109/cvpr.2015.7298935
18	Jing BY, Wang ZY, Xing E. Show, describe and conclude: on exploiting the structure information of chest X-ray reports[EB/OL]. 2020. . doi：10.18653/v1/p19-1657
19	Liu G, Hsu H, McDermott M, et al. Clinically accurate chest x-ray report generation[J]. PMLR, 2019, 106: 249-69.
20	Chen ZH, Shen YL, Song Y, et al. Cross-modal memory networks for radiology report generation[EB/OL]. 2022. . doi：10.18653/v1/2021.acl-long.459
21	Chen ZH, Song Y, Chang TH, et al. Generating radiology reports via memory-driven transformer[EB/OL]. 2020. . doi：10.18653/v1/2020.emnlp-main.112
22	Li M, Liu R, Wang F, et al. Auxiliary signal-guided knowledge encoder-decoder for medical report generation[J]. World Wide Web, 2023, 26(1): 253-70. doi：10.1007/s11280-022-01013-6
23	Wang J, Bhalerao A, He YL. Cross-modal prototype driven network for radiology report generation[M]//Computer Vision. Springer Nature Switzerland, 2022: 563-79. doi：10.1007/978-3-031-19833-5_33
24	Liu FL, Wu X, Ge S, et al. Exploring and distilling posterior and prior knowledge for radiology report generation[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021: 13753-62. doi：10.1109/cvpr46437.2021.01354
25	Zhang Y, Wang X, Xu Z, et al. When radiology report generation meets knowledge graph [C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 12910-7. doi：10.1609/aaai.v34i07.6989
26	Yin CC, Qian BY, Wei JS, et al. Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network[C]//2019 IEEE International Conference on Data Mining (ICDM), 2019: 728-37. doi：10.1109/icdm.2019.00083
27	You D, Liu FL, Ge S, et al. AlignTransformer: hierarchical alignment of visual regions and disease tags for medical report generation[C]//Medical Image Computing and Computer Assisted Intervention, 2021: 72-82. doi：10.1007/978-3-030-87199-4_7
28	Wang ZY, Tang MK, Wang L, et al. A medical semantic-assisted transformer for radiographic report generation[C]//Medical Image Computing and Computer Assisted Intervention, 2022: 655-64. doi：10.1007/978-3-031-16437-8_63
29	Yang S, Wu X, Ge S, et al. Knowledge matters: Chest radiology report generation with general and specific knowledge[J]. Med Image Anal, 2022, 80: 102510. doi：10.1016/j.media.2022.102510
30	Wang ZY, Liu LQ, Wang L, et al. METransformer: radiology report generation by transformer with multiple learnable expert tokens[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 11558-67. doi：10.1109/cvpr52729.2023.01112
31	Li M, Lin B, Chen Z, et al. Dynamic graph enhanced contrastive learning for chest x-ray report generation[C]//2023 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 3334-43. doi：10.1109/cvpr52729.2023.00325
32	Tanida T, Müller P, Kaissis G, et al.Interactive and explainable region-guided radiology report generation[C]//IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, 2023: 7433-42. doi：10.1109/cvpr52729.2023.00718