An efficient and lightweight skin pathology detection method based on multi-scale feature fusion using an improved RT-DETR model

doi:10.12122/j.issn.1673-4254.2025.02.22

Abstract

Abstract:

Objective The presence of multi-scale skin lesion regions and image noise interference and limited resources of auxiliary diagnostic equipment affect the accuracy of skin disease detection in skin disease detection tasks. To solve these problems, we propose a highly efficient and lightweight skin disease detection model using an improved RT-DETR model. Method A lightweight FasterNet was introduced as the backbone network and the FasterNetBlock module was parametrically refined. A Convolutional and Attention Fusion Module (CAFM) was used to replace the multi-head self-attention mechanism in the neck network to enhance the ability of the AIFI-CAFM module for capturing global dependencies and local detail information. The DRB-HSFPN feature pyramid network was designed to replace the Cross-Scale Feature Fusion Module (CCFM) to allow the integration of contextual information across different scales to improve the semantic feature expression capacity of the neck network. Finally, combining the advantages of Inner-IoU and EIoU, the Inner-EIoU was used to replace the original loss function GIOU to further enhance the model's inference accuracy and convergence speed. Results The experimental results on the HAM10000 dataset showed that the improved RT-DETR model, as compared with the original model, had increased mAP@50 and mAP@50:95 by 4.5% and 2.8%, respectively, with a detection speed of 59.1 frames per second (FPS). The improved model had a parameter count of 10.9 M and a computational load of 19.3 GFLOPs, which were reduced by 46.0% and 67.2% compared to those of the original model, validating the effectiveness of the improved model. Conclusion The proposed SD-DETR model significantly improves the performance of skin disease detection tasks by effectively extracting and integrating multi-scale features while reducing both parameter count and computational load.

Key words: skin disease, lightweight network, multi-feature fusion, attention mechanism, RT-DETR

Yuying REN, Lingxiao HUANG, Fang DU, Xinbo YAO. An efficient and lightweight skin pathology detection method based on multi-scale feature fusion using an improved RT-DETR model[J]. Journal of Southern Medical University, 2025, 45(2): 409-421.

Figures/Tables 15

Tab.1 Dataset statistical information

Label Name	Mark the quantity of medical equipment	Number of images
MEL	1123	1113
NV	6761	6705
BCC	517	514
AKIEC	334	327
BKL	1124	1099
DF	116	115
VASC	142	142

Fig.1 HAM10000 dermoscopic image case example. HAM: Human Against Machine.

Fig.2 SD-DETR model network architecture.

Tab.2 Main structure of the FasterNet backbone network

Layer number and name	Output size	Layer structure
0-Embeddding	160×160	Conv 4×4, 40
1-Stage 1	160×160	FasterNetRepBlock, 40
2-Merging	80×80	Conv 2×2, 80
3-Stage 2	80×80	FasterNetRepBlock(Repeat 2 times), 80
4-Merging	40×40	Conv 2×2, 160,
5-Stage 3	40×40	FasterNetRepBlock(Repeat 8 times), 160
6-Merging	20×20	Conv 2×2, 320
7-Stage 4	20×20	FasterNetRepBlock(Repeat 2 times), 320

Fig.3 FasterNetRepBlock structure.

Fig.4 AIFI-CAFM module structure.

Fig.5 CAFM structural diagram.

Fig.6 Inflated reparameterization convolution module.

Tab.3 Experimental environment configuration

Configuration name	Model version
System environment	Ubuntu20.04.6
CPU	Intel(R) Xeon(R) Gold 5418Y
GPU	NVIDIA GeForce RTX 4090 24GB
CUDA	CUDA 11.3
RAM	32 GB
Deep learning framework	Pytorch 1.13.0

Tab.4 Ablation experiment results

Experiment Number	Reparameterized FasterNet	AIFI-CAFM	DRB-HSFPN	mAP50(%)	Param(M)	FLOPs/G
1				49.3	20.2	58.8
2	√			50.7	11.1	29.9
3		√		51.6	20.9	59.7
4			√	49.6	17.3	45.6
5	√	√		52.7	13.7	32.3
6	√	√	√	53.8	10.9	19.3

Tab.5 Comparison of experimental results by introducing different loss functions (%)

Loss	P	R	mAP $50$	mAP $50 : 95$
GIoU	69.7	53.2	52.8	42.5
CIoU	62.1	51.9	52.4	42.2
DIoU	68.7	54.5	52.7	42.7
SIoU	69.5	54.2	53.1	43.0
EIoU	66.7	57.3	53.3	43.1
Inner-EIoU	71.9	55.2	53.8	43.3

Tab.5 Comparison of experimental results by introducing different loss functions (%)

Loss	P	R	mAP $50$	mAP $50 : 95$
GIoU	69.7	53.2	52.8	42.5
CIoU	62.1	51.9	52.4	42.2
DIoU	68.7	54.5	52.7	42.7
SIoU	69.5	54.2	53.1	43.0
EIoU	66.7	57.3	53.3	43.1
Inner-EIoU	71.9	55.2	53.8	43.3

Tab.6 Comparison of model performance across various categories on HAM10000 before and after improvements

Category	P $1$	P $2$	R $1$	R $2$	mAP $1$ @50	mAP $2$ @50
All	64.6	71.9	53.4	55.2	49.3	53.8
MEL	70.0	70.9	35.3	41.8	34.6	47.3
NV	79.2	79.6	92.6	85.6	86.8	83.2
BCC	52.6	68.1	58.5	58.8	42.3	51.8
AKIEC	61.4	65.9	39.9	50.0	40.5	35.6
BKL	41.5	67.3	45.9	50.2	31.6	47.4
DF	59.9	75.5	59.6	57.1	54.7	64.1
VASC	67.3	75.9	40.7	42.9	43.6	47.0

Tab.6 Comparison of model performance across various categories on HAM10000 before and after improvements

Category	P $1$	P $2$	R $1$	R $2$	mAP $1$ @50	mAP $2$ @50
All	64.6	71.9	53.4	55.2	49.3	53.8
MEL	70.0	70.9	35.3	41.8	34.6	47.3
NV	79.2	79.6	92.6	85.6	86.8	83.2
BCC	52.6	68.1	58.5	58.8	42.3	51.8
AKIEC	61.4	65.9	39.9	50.0	40.5	35.6
BKL	41.5	67.3	45.9	50.2	31.6	47.4
DF	59.9	75.5	59.6	57.1	54.7	64.1
VASC	67.3	75.9	40.7	42.9	43.6	47.0

Tab.7 Performance comparison of different models

Model	Backbone	Param (M)	FLOPs (G)	mAP $50$ (%)	mAP $50 : 95$ (%)	FPS $b s = 1$
Faster-RCNN^[24]	R50	137.1	370.2	39.3	25.5	26.6
YOLOv7^[25]	-	36.5	104.7	44.3	33.9	53.7
YOLOv7-X	-	70.8	188.1	48.6	37.5	22.3
YOLOv8-S^[26]	-	11.2	28.6	47.3	38.3	61.3
YOLOv8-M	-	26.9	79.1	48.2	39.2	46.2
YOLOv8-L	-	43.7	165.1	49.6	41.1	31.4
YOLOv9-S^[27]	-	7.1	26.2	46.3	37.4	28.0
YOLOv9-M	-	20.1	76.9	50.1	37.3	38.1
GOLD-YOLO-S^[28]	-	21.3	46.1	44.1	34.4	55.3
GOLD-YOLO-M	-	41.2	87.3	46.2	35.9	37.4
Deformable-DETR^[29]	R50	39.8	172.9	45.0	34.6	-
DINO^[30]	R50	47.2	279.0	44.1	34.4	6.4
DAB-DETR^[31]	R50	35.2	210.0	46.9	37.9	-
Conditional-DETR^[32]	R50	44.0	86.3	45.3	33.9	-
RT-DETR	R18	20.2	58.8	49.3	40.5	40.2
RT-DETR	R34	31.4	88.6	50.1	41.9	33.1
RT-DETR	R50	40.3	134.8	50.8	42.8	29.8
SD-DETR	FasterNet	10.9	19.3	53.8	43.3	59.1

Tab.7 Performance comparison of different models

Model	Backbone	Param (M)	FLOPs (G)	mAP $50$ (%)	mAP $50 : 95$ (%)	FPS $b s = 1$
Faster-RCNN^[24]	R50	137.1	370.2	39.3	25.5	26.6
YOLOv7^[25]	-	36.5	104.7	44.3	33.9	53.7
YOLOv7-X	-	70.8	188.1	48.6	37.5	22.3
YOLOv8-S^[26]	-	11.2	28.6	47.3	38.3	61.3
YOLOv8-M	-	26.9	79.1	48.2	39.2	46.2
YOLOv8-L	-	43.7	165.1	49.6	41.1	31.4
YOLOv9-S^[27]	-	7.1	26.2	46.3	37.4	28.0
YOLOv9-M	-	20.1	76.9	50.1	37.3	38.1
GOLD-YOLO-S^[28]	-	21.3	46.1	44.1	34.4	55.3
GOLD-YOLO-M	-	41.2	87.3	46.2	35.9	37.4
Deformable-DETR^[29]	R50	39.8	172.9	45.0	34.6	-
DINO^[30]	R50	47.2	279.0	44.1	34.4	6.4
DAB-DETR^[31]	R50	35.2	210.0	46.9	37.9	-
Conditional-DETR^[32]	R50	44.0	86.3	45.3	33.9	-
RT-DETR	R18	20.2	58.8	49.3	40.5	40.2
RT-DETR	R34	31.4	88.6	50.1	41.9	33.1
RT-DETR	R50	40.3	134.8	50.8	42.8	29.8
SD-DETR	FasterNet	10.9	19.3	53.8	43.3	59.1

Fig.7 Partial detection results of different models.

Fig.8 Heatmap visualization results of different algorithms. A-C: Test figures.

References 34

1	Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries[J]. CA Cancer J Clin, 2021, 71(3): 209-49.
2	付学锋, 王美燕, 陈筱筱. 皮肤镜在颜面部皮肤肿瘤筛检中的应用效果观察[J]. 中国现代医生, 2018, 56(21): 86-8, 92.
3	Li XY, Wang LX, Zhang L, et al. Application of multimodal and molecular imaging techniques in the detection of choroidal melanomas[J]. Front Oncol, 2021, 10: 617868.
4	Argenziano G, Catricalà C, Ardigo M, et al. Seven-point checklist of dermoscopy revisited[J]. Br J Dermatol, 2011, 164(4): 785-90.
5	Ganster H, Pinz A, Röhrer R, et al. Automated melanoma recognition[J]. IEEE Trans Med Imaging, 2001, 20(3): 233-9.
6	Rana M, Bhushan M. Machine learning and deep learning approach for medical image analysis: diagnosis to detection[J]. Multimed Tools Appl, 2022: 1-39.
7	邵虹, 张鸣坤, 崔文成. 基于分层卷积神经网络的皮肤镜图像分类方法[J]. 智能科学与技术学报, 2021, 3(4): 474-81.
8	郑顺源, 胡良校, 吕晓倩, 等. 基于边缘引导的自校正皮肤检测[J]. 计算机科学, 2022, 49(11): 141-7.
9	Huang HY, Hsiao YP, Mukundan A, et al. Classification of skin cancer using novel hyperspectral imaging engineering via YOLOv5[J]. J Clin Med, 2023, 12(3): 1134.
10	沈鑫, 魏利胜. 基于注意力残差U-Net的皮肤镜图像分割方法[J]. 智能系统学报, 2023, 18(4): 699-707.
11	王玉峰, 成昊沅, 万承北, 等. 一种基于双分支注意力神经网络的皮肤癌检测框架[J]. 中国生物医学工程学报, 2024, 43(2): 153-61.
12	高埂, 肖风丽, 杨飞. 基于改进MobileNetV3-Small的色素减退性皮肤病诊断[J]. 计算机与现代化, 2024(5): 120-6.
13	Zhao YA, Lv WY, Xu SL, et al. DETRs beat YOLOs on real-time object detection[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 16-22, 2024, Seattle, WA, USA. IEEE, 2024: 16965-74.
14	Li D, Han T, Zhou HT, et al. Lightweight Siamese network for visual tracking via FasterNet and feature adaptive fusion[C]//2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT). March 29-31, 2024, Nanjing, China. IEEE, 2024: 1-5.
15	Hu S, Gao F, Zhou XW, et al. Hybrid convolutional and attention network for hyperspectral image denoising[J]. IEEE Geosci Remote Sens Lett, 2024, 21: 5504005.
16	Ding XH, Zhang YY, Ge YX, et al. UniRepLKNet: a universal perception large-kernel ConvNet for audio, video, point cloud, time-series and image recognition[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 16-22, 2024, Seattle, WA, USA. IEEE, 2024: 5513-24.
17	Chen YF, Zhang CY, Chen B, et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases[J]. Comput Biol Med, 2024, 170: 107917.
18	Zhang H, Xu C, Zhang SJ. Inner-IoU: more effective intersection over union loss with auxiliary bounding box[EB/OL]. 2023: 2311.02877. .
19	Zhang YF, Ren WQ, Zhang Z, et al. Focal and efficient IOU loss for accurate bounding box regression[J]. Neurocomputing, 2022, 506: 146-57.
20	Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions[J]. Sci Data, 2018, 5: 180161.
21	Ding XH, Zhang XY, Ma NN, et al. RepVGG: making VGG-style ConvNets great again[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 20-25, 2021, Nashville, TN, USA. IEEE, 2021: 13728-37.
22	Zheng ZH, Wang P, Liu W, et al. Distance-IoU loss: faster and better learning for bounding box regression[J]. Proc AAAI Conf Artif Intell, 2020, 34(7): 12993-3000.
23	Gevorgyan Z. SIoU loss: more powerful learning for bounding box regression[EB/OL]. 2022: 2205.12740. .
24	Ren SQ, He KM, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Trans Pattern Anal Mach Intell, 2017, 39(6): 1137-49.
25	Wang CY, Bochkovskiy A, Liao HM. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 17-24, 2023, Vancouver, BC, Canada. IEEE, 2023: 7464-75.
26	Reis D, Kupec J, Hong J,et al. Real-Time Flying Object Detection with YOLOv8[J].ArXiv, 2023, abs/2305.09972.DOI:10.48550/arXiv.2305.09972 .
27	Wang CY, Yeh IH, Mark Liao HY. YOLOv9: learning what you want to learn using programmable gradient information[M]//Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland, 2024: 1-21.
28	Wang CC, He W, Nie Y, et al. Gold-YOLO: efficient object detector via gather-and-distribute mechanism[EB/OL]. 2023: 2309.11331. .
29	Zhu XZ, Su WJ, Lu LW, et al. Deformable DETR: deformable transformers for end-to-end object detection[EB/OL]. 2020: 2010.04159. .
30	Zhang H, Li F, Liu SL, et al. DINO: DETR with improved DeNoising anchor boxes for end-to-end object detection[EB/OL]. 2022: 2203.03605. .
31	Liu SL, Li F, Zhang H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[EB/OL]. 2022: 2201.12329. .
32	Meng DP, Chen XK, Fan ZJ, et al. Conditional DETR for fast training convergence[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). October 10-17, 2021, Montreal, QC, Canada. IEEE, 2021: 3631-40.
33	Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[J]. Int J Comput Vis, 2020, 128(2): 336-59.
34	He KM, Zhang XY, Ren SQ, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30, 2016, Las Vegas, NV, USA. IEEE, 2016: 770-8.

[1]	Peishan ZHOU, Wei YANG, Qingyuan LI, Xiaofang GUO, Rong FU, Side LIU. A fusion model of manually extracted visual features and deep learning features for rebleeding risk stratification in peptic ulcers [J]. Journal of Southern Medical University, 2025, 45(1): 197-205.
[2]	GONG Gao, CAO Shi, XIAO Hui, FANG Weiyang, QUE Yuqing, LIU Ziwei, CHEN Chaomin. Prediction of microvascular invasion in hepatocellular carcinoma with magnetic resonance imaging using models combining deep attention mechanism with clinical features [J]. Journal of Southern Medical University, 2023, 43(5): 839-851.
[3]	WU Xueyang, ZHANG Yu, ZHANG Hua, ZHONG Tao. Whole-brain parcellation for macaque brain magnetic resonance images based on attention mechanism and multi-modality feature fusion [J]. Journal of Southern Medical University, 2023, 43(12): 2118-2125.
[4]	ZHONG Youwen, CHE Wengang, GAO Shengxiang. A lightweight multiscale target object detection network for melanoma based on attention mechanism manipulation [J]. Journal of Southern Medical University, 2022, 42(11): 1662-1671.
[5]	ZHANG Xiaoyue, WANG Yongxiong, ZHANG Jiapeng, SUN Hongxin, WANG Dong, CHEN Yu, ZHOU Zhi. Screening of early gastric cancer using Pre-Activation Squeeze-and-Excitation ResNet [J]. Journal of Southern Medical University, 2021, 41(11): 1616-1622.