中国血吸虫病防治杂志 ›› 2024, Vol. 36 ›› Issue (6): 562-571,613.

• 论著 • 上一篇    下一篇

基于随机森林和最大熵模型的云南省钉螺潜在地理分布预测

张宗亚1,杜春红1,张云1,王洪琼1,宋静1,周济华1,王丽芳1,孙佳昱1,沈美芬1,陈春琼1,
江华1,颜嘉琦1,冯锡光1,王文雅2,钱沛君2,薛靖波2,李石柱2*,董毅1*
  

  1. 1云南省地方病防治所、云南省自然疫源性疾病防控技术重点实验室(云南 大理 671000);2中国疾病预防控制中心寄生虫病预防控制所(国家热带病研究中心)、国家卫生健康委员会寄生虫病原与媒介生物学重点实验室(上海 200025)
  • 出版日期:2024-12-25 发布日期:2024-12-31
  • 通讯作者: 李石柱lisz@chinacdc.cn;董毅dydali@sina.com
  • 作者简介:张宗亚,男,本科,医师。研究方向:血吸虫病防治
  • 基金资助:
    国家重点研发计划(2021YFC2300800,2021YFC2300804);国家自然科学基金(32161143036,32311540013);云南省地方病防治所青年科技人才培养项目(YIEDC⁃T202104)

Prediction of potential geographic distribution of Oncomelania hupensis in Yunnan Province using random forest and maximum entropy models

ZHANG Zongya1, DU Chunhong1, ZHANG Yun1, WANG Hongqiong1, SONG Jing1, ZHOU Jihua1, WANG Lifang1, SUN Jiayu1,SHEN Meifen1, CHEN Chunqiong1, JIANG Hua1, YAN Jiaqi1, FENG Xiguang1, WANG Wenya2, QIAN Peijun2, XUE Jingbo2, LI Shizhu2*, DONG Yi1*   

  1. 1 Yunnan Institute of Endemic Diseases Control and Prevention, Yunnan Provincial Key Laboratory for Zoonosis Control and Prevention, Dali, Yunnan 671000, China; 2 National Institute of Parasitic Diseases, Chinese Center for Disease Control and Prevention (Chinese Center for Tropical Diseases Research), National Health Commission Key Laboratory of Parasite and Vector Biology,Shanghai 200025, China 
  • Online:2024-12-25 Published:2024-12-31

摘要: 目的 采用随机森林(random forest,RF)模型与最大熵(maximum entropy,MaxEnt)模型预测云南省钉螺潜在分布区,为云南省钉螺监测和控制提供参考。方法 收集2015—2016年云南省钉螺调查数据并转换为钉螺存在点数据;收集云南省22个环境变量数据,包括年蒸发量、年平均地温、年降水量、年平均气压、年平均相对湿度、年日照时数、年平均气温、年平均风速、≥ 0 ℃积温、≥ 10 ℃积温、干燥度、湿润指数等12个气候变量,归一化植被指数、地貌类型、土地利用类型、海拔高度、土壤类型、土壤质地⁃黏土含量、土壤质地⁃沙土含量、土壤质地⁃粉砂土含量等8个地理变量及国内生产总值、人口分布2个人口经济学变量。经Pearson相关性检验和方差膨胀因子(variance inflation factor,VIF)检验筛选后,基于R 4.2.1 软件biomod2包构建RF、MaxEnt模型及其组合模型,预测2016年后云南省钉螺潜在分布区。采用交叉验证和独立数据验证法,计算受试者工作特征(receiver operator characteristic,ROC)曲线下面积(area under curve,AUC)、真实技巧统计值(true skill statistics,TSS)和Kappa统计量以评价模型预测性能;选择AUC > 0.95且TSS > 0.85的模型输出的环境变量贡献值进行归一化处理,获得环境变量重要性百分比以分析环境变量重要性。结果 累计将148个钉螺存在点数据和15个环境变量纳入RF、MaxEnt模型进行训练,RF与MaxEnt模型预测性能均较佳,AUC均值> 0.9、TSS和Kappa均值均> 0.8,且两模型AUC(t = 19.862,P < 0.05)、TSS(t = 10.140,P < 0.05)和Kappa值(t = 10.237,P < 0.05)差异均有统计学意义;组合模型AUC、TSS和Kappa值分别为0.996、0.954和0.920。独立数据验证发现,RF模型和组合模型建模结果的AUC、TSS和Kappa值均为1,在未知数据建模中表现良好;MaxEnt模型则表现较弱,24%(24/100)的建模结果TSS、Kappa值为0。累计有79个RF模型建模结果和38个MaxEnt模型建模结果及组合模型建模结果AUC > 0.95且TSS > 0.85,被纳入环境变量重要性评价。RF、MaxEnt和组合模型中,年日照时数重要性占比分别为32.989%、37.847%和46.315%;年平均相对湿度重要性占比分别为30.947%、15.921%和28.121%,均为较重要变量。重要环境变量于RF模型建模结果中较集中,于MaxEnt模型建模结果中较分散,而于组合模型建模结果中最集中。采用RF模型预测的2016年后云南省钉螺潜在分布区相对集中,采用MaxEnt模型预测的钉螺潜在分布区相对较大,采用组合模型预测的钉螺潜在分布区突出了RF、MaxEnt模型预测结果共同的钉螺集中分布区。结论 RF和MaxEnt模型可有效预测云南省钉螺潜在分布区,有助于开展有针对性的钉螺防控工作。 

关键词: 湖北钉螺, 随机森林模型, 最大熵模型, 地理分布, 预测性能, 云南省

Abstract: Objective To predict the potential geographic distribution of Oncomelania hupensis  in Yunnan Province using random forest (RF) and maximum entropy (MaxEnt) models, so as to provide insights into O. hupensis surveillance and control in Yunnan Province.  Methods The O. hupensis snail survey data in Yunnan Province from 2015 to 2016 were collected and converted into O. hupensis snail distribution site data. Data of 22 environmental variables in Yunnan Province were collected, including twelve climate variables (annual potential evapotranspiration, annual mean ground surface temperature, annual precipitation, annual mean air pressure, annual mean relative humidity, annual sunshine duration, annual mean air temperature, annual mean wind speed, ≥ 0 ℃ annual accumulated temperature, ≥ 10 ℃ annual accumulated temperature, aridity and index of moisture), eight geographical variables (normalized difference vegetation index, landform type, land use type, altitude, soil type, soil texture⁃clay content, soil texture⁃sand content and soil texture⁃silt content) and two population and economic variables (gross domestic product and population).  Variables were screened with Pearson correlation test and variance inflation factor (VIF) test. The RF and MaxEnt models and the ensemble model were created using the biomod2 package of the software R 4.2.1, and the potential distribution of O. hupensis snails after 2016 was predicted in Yunnan Province. The predictive effects of models were evaluated through cross⁃validation and independent tests, and the area under the receiver operating characteristic curve (AUC), true skill statistics (TSS) and Kappa statistics were used for model evaluation. In addition, the importance of environmental variables was analyzed, the contribution of environmental variables output by the models with AUC values of > 0.95 and TSS values of > 0.85 were selected for normalization processing, and the importance percentage of environmental variables was obtained to analyze the importance of environmental variables. Results Data of 148 O. hupensis snail distribution sites and 15 environmental variables were included in training sets of RF and MaxEnt models, and both RF and MaxEnt models had high predictive performance, with both mean AUC values of > 0.9 and all mean TSS values and Kappa values of > 0.8, and significant differences in the AUC (t = 19.862, P < 0.05), TSS (t = 10.140, P < 0.05) and Kappa values (t = 10.237, P < 0.05) between two models. The AUC, TSS and Kappa values of the ensemble model were 0.996, 0.954 and 0.920, respectively. Independent data verification showed that the AUC, TSS and Kappa values of the RF model and the ensemble model were all 1, which still showed high performance in unknown data modeling, and the MaxEnt model showed poor performance, with TSS and Kappa values of 0 for 24% (24/100) of the modeling results. The modeling results of 79 RF models, 38 MaxEnt models and their ensemble models with AUC values of > 0.95 and TSS values of > 0.85 were included in the evaluation of importance of environmental variables. The importance of annual sunshine duration (SSD) was 32.989%, 37.847% and 46.315% in the RF model, the MaxEnt model and their ensemble model, while the importance of annual mean relative humidity (RHU) was 30.947%, 15.921% and 28.121%, respectively. Important environment variables were concentrated in modeling results of the RF model, dispersed in modeling results of the MaxEnt model, and most concentrated in modeling results of the ensemble model. The potential distribution of O. hupensis snails after 2016 was predicted to be relatively concentrated in Yunnan Province by the RF model and relatively large by the MaxEnt model, and the distribution of O. hupensis snails predicted by the ensemble model was mostly the joint distribution of O. hupensis snails predicted by RF and MaxEnt models.  Conclusions  Both RF and MaxEnt models are effective to predict the potential distribution of O. hupensis snails in Yunnan Province, which facilitates targeted O. hupensis snail control. 

Key words: Oncomelania hupensis, Random forest model, Maximum entropy model, Geographical distribution, Predictive performance, Yunnan Province 

中图分类号: