南方医科大学学报 ›› 2026, Vol. 46 ›› Issue (2): 466-472.doi: 10.12122/j.issn.1673-4254.2026.02.24
• • 上一篇
收稿日期:2025-07-22
出版日期:2026-02-20
发布日期:2026-03-10
通讯作者:
安胜利
E-mail:kk20001205@163.com;1069766473@qq.com
作者简介:李柯柯,在读硕士研究生,E-mail: kk20001205@163.com
基金资助:
Keke LI(
), Lishan XU, Milai YU, Shengli AN(
)
Received:2025-07-22
Online:2026-02-20
Published:2026-03-10
Contact:
Shengli AN
E-mail:kk20001205@163.com;1069766473@qq.com
摘要:
目的 通过模拟研究探讨不同缺失值处理方法对AC1系数(第一阶一致性系数)估计的影响,为实际应用提供参考。 方法 使用Monte Carlo模拟生成不同缺失机制下的无序评价数据,模拟参数包括评价者数量、类别数量、样本量、疾病流行率、偶然评价率和缺失比例。比较删除零评价受试者法、删除非完整评价受试者法、评价者众数填补法和受试者众数填补法4种缺失值处理方法,以偏差(bias)和均方误差(MSE)作为评价指标。 结果 在疾病流行率均衡或缺失机制为完全随机缺失(MCAR)/随机缺失(MAR)时,删除零评价受试者法表现最佳,在缺失比例低于30%时偏差和MSE近乎为0。而当流行率非均衡且存在非随机缺失(MNAR)时,受试者众数填补法更具优势,其偏差控制在±0.10以内,MSE保持在0.09以下,尤其在样本量充足且缺失比例不超过30%时MSE几乎为0。评价者众数填补法在所有场景中表现最差。删除非完整评价受试者法仅在2评价者2分类、低缺失比例且为MCAR/MAR时误差较小,其他场景下稳定性不足。 结论 不存在一种普遍最优的缺失值处理方法。在流行率均衡或可假设数据缺失机制为MCAR和MAR时,推荐删除零评价受试者法;在流行率非均衡且怀疑存在MNAR时,推荐受试者众数填补法。此外,建议研究者同时汇报多种方法下的AC1系数估计值以评估结果敏感性。
李柯柯, 徐利珊, 于米铼, 安胜利. 缺失数据下一致性系数AC1不同处理方法的比较[J]. 南方医科大学学报, 2026, 46(2): 466-472.
Keke LI, Lishan XU, Milai YU, Shengli AN. Comparison of missing data handling methods for AC1 coefficient estimation[J]. Journal of Southern Medical University, 2026, 46(2): 466-472.
| Rater A | Rater B | ||||||
|---|---|---|---|---|---|---|---|
| Random ratings | Certain ratings | ||||||
| 1 | ... | k | 1 | ... | k | ||
| Random ratings | 1 | n11RR | ... | n1kRR | n11RC | ... | n1kRC |
| ... | ... | njj'RR | ... | ... | njj'RC | ... | |
| k | nk1RR | ... | nkkRR | nk1RC | ... | nkkRC | |
| Certain ratings | 1 | n11CR | ... | n1kCR | n11CC | ... | 0 |
| ... | ... | njj'CR | ... | ... | njj'CC | ... | |
| k | nk1CR | ... | nkkCR | 0 | ... | nkkCC | |
表1 偶然评价和确定评价的列联表
Tab.1 Contingency table for random and certain ratings
| Rater A | Rater B | ||||||
|---|---|---|---|---|---|---|---|
| Random ratings | Certain ratings | ||||||
| 1 | ... | k | 1 | ... | k | ||
| Random ratings | 1 | n11RR | ... | n1kRR | n11RC | ... | n1kRC |
| ... | ... | njj'RR | ... | ... | njj'RC | ... | |
| k | nk1RR | ... | nkkRR | nk1RC | ... | nkkRC | |
| Certain ratings | 1 | n11CR | ... | n1kCR | n11CC | ... | 0 |
| ... | ... | njj'CR | ... | ... | njj'CC | ... | |
| k | nk1CR | ... | nkkCR | 0 | ... | nkkCC | |
| Scenario | ||
|---|---|---|
| Two Raters (r=2) | ||
| Multiple Raters (r≥3) |
表2 AC1系数的计算公式
Tab.2 AC1 coefficient formulas
| Scenario | ||
|---|---|---|
| Two Raters (r=2) | ||
| Multiple Raters (r≥3) |
| Rater A | Rater B | Total | |||
|---|---|---|---|---|---|
| 1 | ... | k | Missing | ||
| 1 | n11 | ... | n1k | n1m | n1+ |
| ... | ... | njj' | ... | … | … |
| k | nk1 | ... | nkk | nkm | nk+ |
| Missing | nm1 | … | nmk | nmm | nm+ |
| Total | n+1 | … | n+k | n+m | n |
表3 响应分类存在缺失的列联表
Tab.3 Contingency table for response category with missing ratings
| Rater A | Rater B | Total | |||
|---|---|---|---|---|---|
| 1 | ... | k | Missing | ||
| 1 | n11 | ... | n1k | n1m | n1+ |
| ... | ... | njj' | ... | … | … |
| k | nk1 | ... | nkk | nkm | nk+ |
| Missing | nm1 | … | nmk | nmm | nm+ |
| Total | n+1 | … | n+k | n+m | n |
| Parameters | 2 Raters+2 Categories | 8 Raters+4 Categories |
|---|---|---|
| n | 25, 50, 100, 200 | 25, 50, 100, 200 |
| Pr | Skewed: (0.90, 0.10) Balanced: (0.50, 0.50) | Skewed: (0.70, 0.15, 0.10, 0.05) Balanced: (0.25, 0.25, 0.25, 0.25) |
| rq | Low: r1 = 0.05, r2 = 0.05 High: r1 = 0.20, r2 = 0.20 Mixed: r1 = 0.05, r2 = 0.20 | Low: rq = 0.05, q =1, …, 8 High: rq = 0.20, q =1, …, 8 Mixed: rq =0.05, rq+1 = 0.20, q =1, 3, 5, 7 |
| M | 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50% | 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50% |
表4 模拟参数设置
Tab.4 Setting of the simulation parameters
| Parameters | 2 Raters+2 Categories | 8 Raters+4 Categories |
|---|---|---|
| n | 25, 50, 100, 200 | 25, 50, 100, 200 |
| Pr | Skewed: (0.90, 0.10) Balanced: (0.50, 0.50) | Skewed: (0.70, 0.15, 0.10, 0.05) Balanced: (0.25, 0.25, 0.25, 0.25) |
| rq | Low: r1 = 0.05, r2 = 0.05 High: r1 = 0.20, r2 = 0.20 Mixed: r1 = 0.05, r2 = 0.20 | Low: rq = 0.05, q =1, …, 8 High: rq = 0.20, q =1, …, 8 Mixed: rq =0.05, rq+1 = 0.20, q =1, 3, 5, 7 |
| M | 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50% | 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50% |
图1 疾病流行率均衡下2评价者2分类的偏差和均方误差
Fig.1 Bias or Mean squared error (MSE) under balanced disease prevalence for 2 raters and 2 categories. A: Bias; B: MSE. Partial Deletion: excluding subjects with zero ratings; Listwise Deletion: Excluding subjects with incomplete ratings; Rater Mode: Imputation by rater mode; Subject Mode: Imputation by subject mode; Random ratings: Random rating probabilities; Low: r1=0.05, r2=0.05; High: r1=0.20, r2=0.20; Mixed: r1= 0.05, r2=0.20. Different colors represent different methods, and different line types represent different random rating probabilities. For example, the green solid line shows the bias or MSE of Rater Mode with low random rating probabilities; the green dashed line, with high random rating probabilities; and the green dotted line, with mixed random rating probabilities.
图3 疾病流行率均衡下4评价者8分类的偏差和均方误差
Fig.3 Bias or mean squared error (MSE) under balanced disease prevalence for 4 raters and 8 categories. A: Bias. B: MSE. Low: rq = 0.05, q=1, …, 8; High: rq =0.20, q=1, …, 8; Mixed: rq =0.05, rq +1=0.20, q=1, 3, 5, 7. Different colors represent different methods, and different line types represent different random rating probabilities.
图 4 疾病流行率非均衡下4评价者8分类的偏差和均方误差
Fig.4 Bias or mean squared error (MSE) under skewed disease prevalence for 4 raters and 8 categories. A: Bias. B: MSE. Low: rq =0.05, q=1, …, 8; High: rq =0.20, q=1, …, 8; Mixed: rq =0.05, rq +1=0.20, q=1, 3, 5, 7. Different colors represent different methods, and different line types represent different random rating probabilities.
| [1] | Ghoshal A, Enninghorst N, Sisak K, et al. An interobserver reliability comparison between the orthopaedic trauma association's open fracture classification and the gustilo and anderson classification[J]. Bone Joint J, 2018, 100-B(2): 242-6. doi:10.1302/0301-620x.100b2.bjj-2017-0367.r1 |
| [2] | Herzog R, Elgort DR, Flanders AE, et al. Variability in diagnostic error rates of 10 MRI centers performing lumbar spine MRI examinations on the same patient within a 3-week period[J]. Spine J, 2017, 17(4): 554-61. doi:10.1016/j.spinee.2016.11.009 |
| [3] | Nordgaard J, Jessen K, Sæbye D, et al. Variability in clinical diagnoses during the ICD-8 and ICD-10 era[J]. Soc Psychiatry Psychiatr Epidemiol, 2016, 51(9): 1293-9. doi:10.1007/s00127-016-1265-9 |
| [4] | Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: key concepts, approaches, and applications[J]. Res Soc Adm Pharm, 2013, 9(3): 330-8. doi:10.1016/j.sapharm.2012.04.004 |
| [5] | Andrés AM, Marzo PF. Delta: a new measure of agreement between two raters[J]. Br J Math Stat Psychol, 2004, 57(Pt 1): 1-19. doi:10.1348/000711004849268 |
| [6] | Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa[J]. J Clin Epidemiol, 1993, 46(5): 423-9. doi:10.1016/0895-4356(93)90018-v |
| [7] | Mandrekar JN. Measures of interrater agreement[J]. J Thorac Oncol, 2011, 6(1): 6-7. doi:10.1097/jto.0b013e318200f983 |
| [8] | Feinstein AR, Cicchetti DV. High agreement but low kappa: I. the problems of two paradoxes[J]. J Clin Epidemiol, 1990, 43(6): 543-9. doi:10.1016/0895-4356(90)90158-l |
| [9] | Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement[J]. Br J Math Stat Psychol, 2008, 61(Pt 1): 29-48. doi:10.1348/000711006x126600 |
| [10] | Gwet KL. Variance estimation of nominal-scale inter-rater reliability with random selection of raters[J]. Psychometrika, 2008, 73(3): 407-30. doi:10.1007/s11336-007-9054-8 |
| [11] | Wongpakaran N, Wongpakaran T, Wedding D, et al. A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples[J]. BMC Med Res Methodol, 2013, 13: 61. doi:10.1186/1471-2288-13-61 |
| [12] | Kuppens S, Holden G, Barker K, et al. A kappa-related decision: , Y, G, or AC1[J]. Soc Work Res, 2011, 35(3): 185-9. doi:10.1093/swr/35.3.185 |
| [13] | 公为洁, 赵 志, 顾豪高, 等. 二分类资料的五种一致性评价指标应用效果比较[J]. 中国卫生统计, 2016, 33(4): 636-8, 640. |
| [14] | Popplewell M, Reizes J, Zaslawski C. Appropriate statistics for determining chance-removed interpractitioner agreement[J]. J Altern Complement Med, 2019, 25(11): 1115-20. doi:10.1089/acm.2017.0297 |
| [15] | 邓建新, 单路宝, 贺德强, 等. 缺失数据的处理方法及其发展趋势[J]. 统计与决策, 2019, 35(23): 28-34. |
| [16] | De Raadt A, Warrens MJ, Bosker RJ, et al. Kappa coefficients for missing data[J]. Educ Psychol Meas, 2019, 79(3): 558-76. doi:10.1177/0013164418823249 |
| [17] | Little R, Rubin D. Statistical Analysis with Missing Data [M]. 3th ed, John Wiley & Sons, 2019. doi:10.1002/9781119482260 |
| [18] | 熊中敏, 郭怀宇, 吴月欣. 缺失数据处理方法研究综述[J]. 计算机工程与应用, 2021, 57(14): 27-38. |
| [19] | Naberezhneva N, Uleberg O, Dahlhaug M, et al. Excellent agreement of Norwegian trauma registry data compared to corresponding data in electronic patient records[J]. Scand J Trauma Resusc Emerg Med, 2023, 31(1): 50. doi:10.1186/s13049-023-01118-5 |
| [20] | Shrive FM, Stuart H, Quan HD, et al. Dealing with missing data in a multi-question depression scale: a comparison of imputation methods[J]. BMC Med Res Methodol, 2006, 6: 57. doi:10.1186/1471-2288-6-57 |
| [21] | Varmdal T, Ellekjær H, Fjærtoft H, et al. Inter-rater reliability of a national acute stroke register[J]. BMC Res Notes, 2015, 8: 584. doi:10.1186/s13104-015-1556-3 |
| [22] | Gwet KL. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters [M]. 4th ed, Advanced Analytics, 2014. |
| [23] | Heumann P, Aguado-Barrera ME, Avuzzi B, et al. Comparing symptom reporting by prostate cancer patients and healthcare professionals in the international multicentre REQUITE study[J]. Radiother Oncol, 2023, 178: 109426. doi:10.1016/j.radonc.2022.11.015 |
| [24] | Vestergaard T, Prasad SC, Schuster A, et al. Diagnostic accuracy and interobserver concordance: teledermoscopy of 600 suspicious skin lesions in southern Denmark[J]. J Eur Acad Dermatol Venereol, 2020, 34(7): 1601-8. doi:10.1111/jdv.16275 |
| [25] | Xu XY, Xia LZ, Zhang QM, et al. The ability of different imputation methods for missing values in mental measurement questionnaires[J]. BMC Med Res Methodol, 2020, 20(1): 42. doi:10.1186/s12874-020-00932-0 |
| [26] | 梁绮红, 陈昭宇, 张 峥, 等. 一致性评价系数应用于无序多分类资料的效果评价[J]. 南方医科大学学报, 2021, 41(9): 1374-80. |
| [27] | Nelson KP, Edwards D. Measures of agreement between many raters for ordinal classifications[J]. Stat Med, 2015, 34(23): 3116-32. doi:10.1002/sim.6546 |
| [28] | Heymans MW, Twisk JWR. Handling missing data in clinical research[J]. J Clin Epidemiol, 2022, 151: 185-8. doi:10.1016/j.jclinepi.2022.08.016 |
| [29] | 黎剑锋, 张静怡, 李立康, 等. 临床试验中缺失值的处理方法探讨[J]. 协和医学杂志, 2024, 15(5): 1165-72. |
| [30] | van Oest R, Girard JM. Weighting schemes and incomplete data: a generalized Bayesian framework for chance-corrected interrater agreement[J]. Psychol Methods, 2022, 27(6): 1069-88. |
| [31] | de Raadt A. Comparison studies on agreement coefficients with emphasis on missing data [M]. University of Groningen, 2020. |
| [32] | Hyun K. The prevention and handling of the missing data[J]. Korean J Anesthesiol, 2013, 64(5): 402-6. doi:10.4097/kjae.2013.64.5.402 |
| [33] | Permutt T. Sensitivity analysis for missing data in regulatory submissions[J]. Stat Med, 2016, 35(17): 2876-9. doi:10.1002/sim.6753 |
| [34] | Kanukula R, McKenzie JE, Cashin AG, et al. Variation observed in consensus judgments between pairs of reviewers when assessing the risk of bias due to missing evidence in a sample of published meta-analyses of nutrition research[J]. J Clin Epidemiol, 2024, 166: 111244. doi:10.1016/j.jclinepi.2023.111244 |
| [35] | Lang KM, Wu W. A comparison of methods for creating multiple imputations of nominal variables[J]. Multivariate Behav Res, 2017, 52(3): 290-304. doi:10.1080/00273171.2017.1289360 |
| [1] | 卢梓涵, 黄方俊, 蔡光瑶, 刘继红, 甄鑫. 针对缺失实验室指标多约束表征学习的卵巢癌鉴别方法[J]. 南方医科大学学报, 2025, 45(1): 170-178. |
| [2] | 梁绮红, 陈昭宇, 张 峥, 黄 爽, 安胜利. 一致性评价系数应用于无序多分类资料的效果评价[J]. 南方医科大学学报, 2021, 41(9): 1374-1380. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||