Comparison of missing data handling methods for AC1 coefficient estimation

doi:10.12122/j.issn.1673-4254.2026.02.24

Abstract

Abstract:

Objective To explore the impact of different missing data handling methods on AC₁ coefficient estimation through simulation studies. Methods Monte Carlo simulation was used to generate evaluation data under different missing mechanisms. The parameters generated included the number of raters, categories, sample size, disease prevalence, random rating probability, and missing proportion. Four missing data handling methods, by excluding subjects with zero ratings, excluding subjects with incomplete ratings, rater mode imputation, and subject mode imputation, were compared using bias and mean squared error (MSE) as metrics. Results When disease prevalence was balanced or the missing data mechanism was missing completely at random (MCAR) or at random (MAR), excluding subjects with zero ratings showed the best performance, with bias and MSE close to zero at a missing proportion below 30%. Under skewed prevalence and missing not at random (MNAR), subject mode imputation was superior for AC₁ coefficient estimation, resulting in a bias within ±0.10 and an MSE below 0.09; for a sufficient sample size and a missing proportion ≤30%, the MSE of this method was nearly zero. Rater mode imputation showed the worst performance across all these scenarios. Excluding subjects with incomplete ratings resulted in an acceptable error only in relatively simple settings (two raters and two categories) with low a missing proportion under MCAR/MAR, but showed a poor stability in other scenarios. Conclusion No universally optimal method exists for handling missing data in AC₁ estimation. We recommend excluding subjects with zero ratings for balanced prevalence or MCAR/MAR, and subject mode imputation for skewed prevalence under MNAR. Researchers should report AC₁ estimates from multiple methods to allow assessment of result sensitivity.

Key words: agreement evaluation, nominal ratings, AC₁ coefficient, missing data

Keke LI, Lishan XU, Milai YU, Shengli AN. Comparison of missing data handling methods for AC₁ coefficient estimation[J]. Journal of Southern Medical University, 2026, 46(2): 466-472.

Figures/Tables 8

Tab.1 Contingency table for random and certain ratings

Rater A		Rater B
		Random ratings			Certain ratings
		1	...	k	1	...	k
Random ratings	1	n₁₁_RR	...	n₁_kRR	n₁₁_RC	...	n₁_kRC
	...	...	n_jj'RR	...	...	n_jj'RC	...
	k	n_k₁_RR	...	n_kkRR	n_k₁_RC	...	n_kkRC
Certain ratings	1	n₁₁_CR	...	n_1kCR	n₁₁_CC	...	0
	...	...	n_jj'CR	...	...	n_jj'CC	...
	k	n_k₁_CR	...	n_kkCR	0	...	n_kkCC

Tab.2 AC1 coefficient formulas

Scenario

P a

P e

Two Raters (r=2)

P a = ∑ j = 1 k n j j n

$P e = 1 k - 1 ∑ j = 1 k π ̂ j 1 - π ̂ j,$

$w h e r e π ̂ j = n j + + n + j 2 n$

Multiple Raters (r≥3)

P a = 1 n r r - 1 ∑ i = 1 n ∑ j = 1 k r i j 2 - n r

$P e = 1 k - 1 ∑ j = 1 k π ̂ j (1 - π ̂ j)$ ,

$w h e r e π ̂ j = 1 n r ∑ i = 1 n r i j$

Tab.2 AC1 coefficient formulas

Scenario

P a

P e

Two Raters (r=2)

P a = ∑ j = 1 k n j j n

$P e = 1 k - 1 ∑ j = 1 k π ̂ j 1 - π ̂ j,$

$w h e r e π ̂ j = n j + + n + j 2 n$

Multiple Raters (r≥3)

P a = 1 n r r - 1 ∑ i = 1 n ∑ j = 1 k r i j 2 - n r

$P e = 1 k - 1 ∑ j = 1 k π ̂ j (1 - π ̂ j)$ ,

$w h e r e π ̂ j = 1 n r ∑ i = 1 n r i j$

Tab.3 Contingency table for response category with missing ratings

Rater A	Rater B				Total
Rater A	1	...	k	Missing	Total
1	n₁₁	...	n₁_k	n₁_m	n₁₊
...	...	n_jj'	...	…	…
k	n_k₁	...	n_kk	n_km	n_k+
Missing	n_m₁	…	n_mk	n_mm	n_m+
Total	n₊₁	…	n_+k	n_+m	n

Tab.4 Setting of the simulation parameters

Parameters	2 Raters+2 Categories	8 Raters+4 Categories
n	25, 50, 100, 200	25, 50, 100, 200
P_r	Skewed: (0.90, 0.10) Balanced: (0.50, 0.50)	Skewed: (0.70, 0.15, 0.10, 0.05) Balanced: (0.25, 0.25, 0.25, 0.25)
r_q	Low: r₁ = 0.05, r₂ = 0.05 High: r₁ = 0.20, r₂ = 0.20 Mixed: r₁ = 0.05, r₂ = 0.20	Low: r_q = 0.05, q =1, …, 8 High: r_q = 0.20, q =1, …, 8 Mixed: r_q =0.05, r_q+₁ = 0.20, q =1, 3, 5, 7
M	5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%	5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%

Fig.1 Bias or Mean squared error (MSE) under balanced disease prevalence for 2 raters and 2 categories. A: Bias; B: MSE. Partial Deletion: excluding subjects with zero ratings; Listwise Deletion: Excluding subjects with incomplete ratings; Rater Mode: Imputation by rater mode; Subject Mode: Imputation by subject mode; Random ratings: Random rating probabilities; Low: r1=0.05, r2=0.05; High: r1=0.20, r2=0.20; Mixed: r1= 0.05, r2=0.20. Different colors represent different methods, and different line types represent different random rating probabilities. For example, the green solid line shows the bias or MSE of Rater Mode with low random rating probabilities; the green dashed line, with high random rating probabilities; and the green dotted line, with mixed random rating probabilities.

Fig.2 Bias or Mean squared error (MSE) under skewed disease prevalence for 2 raters and 2 categories. A: Bias; B: MSE.

Fig.3 Bias or mean squared error (MSE) under balanced disease prevalence for 4 raters and 8 categories. A: Bias. B: MSE. Low: rq = 0.05, q=1, …, 8; High: rq =0.20, q=1, …, 8; Mixed: rq =0.05, rq +1=0.20, q=1, 3, 5, 7. Different colors represent different methods, and different line types represent different random rating probabilities.

Fig.4 Bias or mean squared error (MSE) under skewed disease prevalence for 4 raters and 8 categories. A: Bias. B: MSE. Low: rq =0.05, q=1, …, 8; High: rq =0.20, q=1, …, 8; Mixed: rq =0.05, rq +1=0.20, q=1, 3, 5, 7. Different colors represent different methods, and different line types represent different random rating probabilities.

References 35

[1]	Ghoshal A, Enninghorst N, Sisak K, et al. An interobserver reliability comparison between the orthopaedic trauma association's open fracture classification and the gustilo and anderson classification[J]. Bone Joint J, 2018, 100-B(2): 242-6. doi：10.1302/0301-620x.100b2.bjj-2017-0367.r1
[2]	Herzog R, Elgort DR, Flanders AE, et al. Variability in diagnostic error rates of 10 MRI centers performing lumbar spine MRI examinations on the same patient within a 3-week period[J]. Spine J, 2017, 17(4): 554-61. doi：10.1016/j.spinee.2016.11.009
[3]	Nordgaard J, Jessen K, Sæbye D, et al. Variability in clinical diagnoses during the ICD-8 and ICD-10 era[J]. Soc Psychiatry Psychiatr Epidemiol, 2016, 51(9): 1293-9. doi：10.1007/s00127-016-1265-9
[4]	Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: key concepts, approaches, and applications[J]. Res Soc Adm Pharm, 2013, 9(3): 330-8. doi：10.1016/j.sapharm.2012.04.004
[5]	Andrés AM, Marzo PF. Delta: a new measure of agreement between two raters[J]. Br J Math Stat Psychol, 2004, 57(Pt 1): 1-19. doi：10.1348/000711004849268
[6]	Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa[J]. J Clin Epidemiol, 1993, 46(5): 423-9. doi：10.1016/0895-4356(93)90018-v
[7]	Mandrekar JN. Measures of interrater agreement[J]. J Thorac Oncol, 2011, 6(1): 6-7. doi：10.1097/jto.0b013e318200f983
[8]	Feinstein AR, Cicchetti DV. High agreement but low kappa: I. the problems of two paradoxes[J]. J Clin Epidemiol, 1990, 43(6): 543-9. doi：10.1016/0895-4356(90)90158-l
[9]	Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement[J]. Br J Math Stat Psychol, 2008, 61(Pt 1): 29-48. doi：10.1348/000711006x126600
[10]	Gwet KL. Variance estimation of nominal-scale inter-rater reliability with random selection of raters[J]. Psychometrika, 2008, 73(3): 407-30. doi：10.1007/s11336-007-9054-8
[11]	Wongpakaran N, Wongpakaran T, Wedding D, et al. A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples[J]. BMC Med Res Methodol, 2013, 13: 61. doi：10.1186/1471-2288-13-61
[12]	Kuppens S, Holden G, Barker K, et al. A kappa-related decision: , Y, G, or AC1[J]. Soc Work Res, 2011, 35(3): 185-9. doi：10.1093/swr/35.3.185
[13]	公为洁, 赵志, 顾豪高, 等. 二分类资料的五种一致性评价指标应用效果比较[J]. 中国卫生统计, 2016, 33(4): 636-8, 640.
[14]	Popplewell M, Reizes J, Zaslawski C. Appropriate statistics for determining chance-removed interpractitioner agreement[J]. J Altern Complement Med, 2019, 25(11): 1115-20. doi：10.1089/acm.2017.0297
[15]	邓建新, 单路宝, 贺德强, 等. 缺失数据的处理方法及其发展趋势[J]. 统计与决策, 2019, 35(23): 28-34.
[16]	De Raadt A, Warrens MJ, Bosker RJ, et al. Kappa coefficients for missing data[J]. Educ Psychol Meas, 2019, 79(3): 558-76. doi：10.1177/0013164418823249
[17]	Little R, Rubin D. Statistical Analysis with Missing Data [M]. 3th ed, John Wiley & Sons, 2019. doi：10.1002/9781119482260
[18]	熊中敏, 郭怀宇, 吴月欣. 缺失数据处理方法研究综述[J]. 计算机工程与应用, 2021, 57(14): 27-38.
[19]	Naberezhneva N, Uleberg O, Dahlhaug M, et al. Excellent agreement of Norwegian trauma registry data compared to corresponding data in electronic patient records[J]. Scand J Trauma Resusc Emerg Med, 2023, 31(1): 50. doi：10.1186/s13049-023-01118-5
[20]	Shrive FM, Stuart H, Quan HD, et al. Dealing with missing data in a multi-question depression scale: a comparison of imputation methods[J]. BMC Med Res Methodol, 2006, 6: 57. doi：10.1186/1471-2288-6-57
[21]	Varmdal T, Ellekjær H, Fjærtoft H, et al. Inter-rater reliability of a national acute stroke register[J]. BMC Res Notes, 2015, 8: 584. doi：10.1186/s13104-015-1556-3
[22]	Gwet KL. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters [M]. 4th ed, Advanced Analytics, 2014.
[23]	Heumann P, Aguado-Barrera ME, Avuzzi B, et al. Comparing symptom reporting by prostate cancer patients and healthcare professionals in the international multicentre REQUITE study[J]. Radiother Oncol, 2023, 178: 109426. doi：10.1016/j.radonc.2022.11.015
[24]	Vestergaard T, Prasad SC, Schuster A, et al. Diagnostic accuracy and interobserver concordance: teledermoscopy of 600 suspicious skin lesions in southern Denmark[J]. J Eur Acad Dermatol Venereol, 2020, 34(7): 1601-8. doi：10.1111/jdv.16275
[25]	Xu XY, Xia LZ, Zhang QM, et al. The ability of different imputation methods for missing values in mental measurement questionnaires[J]. BMC Med Res Methodol, 2020, 20(1): 42. doi：10.1186/s12874-020-00932-0
[26]	梁绮红, 陈昭宇, 张峥, 等. 一致性评价系数应用于无序多分类资料的效果评价[J]. 南方医科大学学报, 2021, 41(9): 1374-80.
[27]	Nelson KP, Edwards D. Measures of agreement between many raters for ordinal classifications[J]. Stat Med, 2015, 34(23): 3116-32. doi：10.1002/sim.6546
[28]	Heymans MW, Twisk JWR. Handling missing data in clinical research[J]. J Clin Epidemiol, 2022, 151: 185-8. doi：10.1016/j.jclinepi.2022.08.016
[29]	黎剑锋, 张静怡, 李立康, 等. 临床试验中缺失值的处理方法探讨[J]. 协和医学杂志, 2024, 15(5): 1165-72.
[30]	van Oest R, Girard JM. Weighting schemes and incomplete data: a generalized Bayesian framework for chance-corrected interrater agreement[J]. Psychol Methods, 2022, 27(6): 1069-88.
[31]	de Raadt A. Comparison studies on agreement coefficients with emphasis on missing data [M]. University of Groningen, 2020.
[32]	Hyun K. The prevention and handling of the missing data[J]. Korean J Anesthesiol, 2013, 64(5): 402-6. doi：10.4097/kjae.2013.64.5.402
[33]	Permutt T. Sensitivity analysis for missing data in regulatory submissions[J]. Stat Med, 2016, 35(17): 2876-9. doi：10.1002/sim.6753
[34]	Kanukula R, McKenzie JE, Cashin AG, et al. Variation observed in consensus judgments between pairs of reviewers when assessing the risk of bias due to missing evidence in a sample of published meta-analyses of nutrition research[J]. J Clin Epidemiol, 2024, 166: 111244. doi：10.1016/j.jclinepi.2023.111244
[35]	Lang KM, Wu W. A comparison of methods for creating multiple imputations of nominal variables[J]. Multivariate Behav Res, 2017, 52(3): 290-304. doi：10.1080/00273171.2017.1289360

Comparison of missing data handling methods for AC₁ coefficient estimation

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 35

Related Articles 2

Recommended Articles

Metrics

Comments

[1]	Zihan LU, Fangjun HUANG, Guangyao CAI, Jihong LIU, Xin ZHEN. A multi-constraint representation learning model for identification of ovarian cancer with missing laboratory indicators [J]. Journal of Southern Medical University, 2025, 45(1): 170-178.
[2]	LIANG Qihong, CHEN Zhaoyu, ZHANG Zheng, HUANG Shuang, AN Shengli. Application of Coefficient for Evaluating Agreement in disordered multi-classification data [J]. Journal of Southern Medical University, 2021, 41(9): 1374-1380.