Journal of Southern Medical University ›› 2026, Vol. 46 ›› Issue (2): 353-361.doi: 10.12122/j.issn.1673-4254.2026.02.13

Previous Articles    

Dual role of tea consumption in gastrointestinal disease risks: analysis using a risk prediction model integrating interpretable machine learning and large language model

Junyao CHEN1(), Zeyu CHEN2(), Zhaojie LIN1, Menghao FANG1, Chaoying SHEN3, Qi XU4, Xiaoyi ZHANG5, Lu LU1()   

  1. 1.School of Disaster and Emergency Medicine, Tianjin University, Tianjin 300072, China
    2.Department of Gastroenterology
    3.Center of Gastrointestinal Endoscopy, Anxi Hospital of Traditional Chinese Medicine, Quanzhou 362400, China
    4.Institute of Open Education, Jinzhou Open University, Jinzhou 121000, China
    5.School of Architecture and Art Design, Henan Polytechnic University, Jiaozuo 454000, China
  • Received:2025-06-26 Online:2026-02-20 Published:2026-03-10
  • Contact: Lu LU E-mail:cjy2300@tju.edu.cn;380893842@qq.com;Lulu_998543@tju.edu.cn

Abstract:

Objective To explore the correlation of tea consumption with risks of gastrointestinal diseases using a risk prediction model integrating interpretable machine learning and a large language model. Methods A survey was conducted among the patients undergoing both gastroscopy and 13C-urea breath testing at Gastrointestinal Endoscopy Center of Anxi Hospital of Traditional Chinese Medicine. Univariate analysis was performed to determine the suitability of feature selection. The collected data were randomly divided into training and testing sets in a 7:3 ratio. Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Logistic Regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGB), and Deep Neural Network (DNN) were applied to identify the best classifier for predicting high-risk gastrointestinal conditions. Bayesian optimization algorithm was used to obtain the optimal hyperparameter combinations for the 6 models. After Model fitting, the interpretability of the best models was analyzed using SHapley Additive exPlanations (SHAP). The DeepSeek-R1 base language model was fine-tuned with gastrointestinal disease dataset and Chinese medical online consultation data to obtain the final model. Results The study included 503 participants. All the selected features showed association with gastrointestinal diseases, but only age exhibited a significant linear correlation (β=0.023, SE=0.008, t=2.942, P=0.003). DNN model performed the best with a good accuracy (0.68), precision (0.68), recall rate (0.85), F1 Score (0.75), and AUC (0.74). The top 3 important features were age, DOB value, and smoking history. The large language model constructed provided recommendations consistent with those of professional physicians based on gastroscopy results. Conclusion DNN model is effective for predicting gastrointestinal disease risk and offers reliable support for clinical risk assessment and decision-making regarding endoscopy. Smoking cessation, moderate alcohol consumption, and reasonable tea intake may help prevent gastrointestinal diseases.

Key words: gastrointestinal diseases, helicobacter pylori, machine learning, risk prediction, large language model