南方医科大学学报 ›› 2021, Vol. 41 ›› Issue (8): 1234-1242.doi: 10.12122/j.issn.1673-4254.2021.08.16

• • 上一篇    下一篇

潜在的胚胎干细胞自我更新与多能性的调控基因的鉴定:基于随机森林算法

曾彭归航,唐秀晓,吴庭芩,田 奇,李茫茫,丁俊军   

  1. 南方医科大学基础医学院,广东 广州 510515;中山大学中山医学院,广东 广州 510080
  • 出版日期:2021-08-20 发布日期:2021-09-07

Identification of potential regulatory genes for embryonic stem cell self-renewal and pluripotency by random forest

ZENG Pengguihang, TANG Xiuxiao, WU Tingqin, TIAN Qi, LI Mangmang, DING Junjun   

  1. School of Basic Medical Science, Southern Medical University, Guangzhou 510515, China; Zhongshan School of Medicine, Sun Yat-Sen University, Guangzhou 510080, China
  • Online:2021-08-20 Published:2021-09-07

摘要: 目的 基于机器学习的方法整合多组学数据在小鼠胚胎干细胞(mESCs)中鉴定潜在的与干细胞自我更新及多能性相关的基因。方法 收集了mESCs的多组学数据,包括转录组、组蛋白修饰、染色质可及性、转录因子及结构蛋白在染色质上的结合等信息,比较了已知的干细胞自我更新及多能性基因与其他基因的信号差异。整合这些多组学数据,基于包含随机森林在内的多种机器学习分类器构建预测模型并进行了5折的交叉验证。输入的样本中2/3作为训练集用于训练模型,剩余的1/3作为测试集用于独立测试来衡量模型的表现。最终通过基因功能注释和细胞活力测定、克隆形成测定及细胞周期分析等细胞功能实验对模型预测的结果进行了验证。结果 已知的多能性与自我更新基因在多组学数据中有显著区别于随机基因的特征。使用这些数据的算法中随机森林构建的模型具有最好的表现,交叉验证的曲线下面积(AUC)为0.883±0.018,独立测试的AUC为 0.880±0.028。该模型鉴定出了893个潜在的自我更新与多能性相关基因,这些基因在基因功能注释上与已知基因类似,而敲低其中新发现的基因Cct6a会导致mESCs的细胞活性显著降低(P<0.0001),形成细胞克隆的数目显著减少(P<0.01),处于G1期的细胞数量显著增加(P<0.01)而处于S期的细胞数量显著减少(P<0.05)。另外,敲低Cct6a基因的mESCs无法被碱性磷酸酶染色。结论 基于多组学数据构建的机器学习模型可以预测潜在的自我更新与多能性相关调控因子且具有较好的效果。通过构建的模型发现了潜在的自我更新与多能性调控基因如Cct6a并进行了实验验证。

关键词: 小鼠胚胎干细胞;自我更新;多能性;机器学习;随机森林

Abstract: Objective To identify novel genes associated with self-renewal and pluripotency of mouse embryonic stem cells (mESCs) by integrating multiomics data based on machine learning methods. Methods We integrated multiomics information of mESCs involving transcriptome, histone modifications, chromatin accessibility, transcription factor binding and architectural protein binding, and compared the signal differences between known stem cell self-renewal and pluripotency genes and other genes. By integrating these multiomics data, we established prediction models based on several machine learning classifiers including random forests and performed 5-fold cross validations. The model was trained using the training dataset containing two thirds of the input samples, and the remaining one third of the input samples were used as the test dataset to assess the performance of the model in independent tests. Finally, the results predicted by the model were validated through gene function annotation and cell function experiments including cell viability assay, colony formation assay and cell cycle analysis. Results Compared with the random genes, the genes known to be associated with self-renewal and pluripotency of mESCs in the multiomics data showed significantly different features. Random forest outperformed the other machine learning algorithms tested on these multiomics data, with an area under the curve (AUC) of 0.883 ± 0.018 for cross validation and an AUC of 0.880±0.028 for independent test. Based on this model, we identified 893 potential regulatory genes associated wwith self-renewal and pluripotency of mESCs, which were similar to the known genes in functional annotation. Known-down of the predicted novel regulator gene Cct6a resulted in significant decreases in the cell viability of mESCs (P< 0.0001) and the number of cell clones (P<0.01), significantly increased the number of cells in G1 phase (P<0.01) and decreased the number of S phase cells (P<0.05). Knockdown of Cct6a also led to failure of positive alkaline phosphatase staining of the mESCs. Conclusion Machine learning model based on multiomics data can be used to predict potential self-renewal and pluripotency regulators with high performance. By using this model, we predicted potential self-renewal and pluripotency regulatory genes including Cct6a and applied experimental validation. This model provides new insights into the regulatory mechanism of mESCs and contribute to stem cell research.

Key words: mouse embryonic stem cells; self-renewal; pluripotency; machine learning; random forest