周亚男, 陈绘, 刘洪斌. 基于多源数据和Stacking-SHAP方法的山地丘陵区土地覆被分类[J]. 农业工程学报, 2022, 38(23): 213-222. DOI: 10.11975/j.issn.1002-6819.2022.23.023
    引用本文: 周亚男, 陈绘, 刘洪斌. 基于多源数据和Stacking-SHAP方法的山地丘陵区土地覆被分类[J]. 农业工程学报, 2022, 38(23): 213-222. DOI: 10.11975/j.issn.1002-6819.2022.23.023
    Zhou Yanan, Chen Hui, Liu Hongbin. Land cover classification in hilly and mountainous areas using multi-source data and Stacking-SHAP technique[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(23): 213-222. DOI: 10.11975/j.issn.1002-6819.2022.23.023
    Citation: Zhou Yanan, Chen Hui, Liu Hongbin. Land cover classification in hilly and mountainous areas using multi-source data and Stacking-SHAP technique[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2022, 38(23): 213-222. DOI: 10.11975/j.issn.1002-6819.2022.23.023

    基于多源数据和Stacking-SHAP方法的山地丘陵区土地覆被分类

    Land cover classification in hilly and mountainous areas using multi-source data and Stacking-SHAP technique

    • 摘要: 山地丘陵区地形复杂,地表辐射信号畸变严重,地物识别困难。为准确提取山区地物信息,结合多源异构数据,Stacking 集成学习和shapley addictive explanation(SHAP)方法展开土地覆被分类研究。从Sentinel-1/2影像、气候数据、土壤数据和数字高程图中提取遥感、气候、土壤和地形四类特征变量,设计多种变量组合方案,结合Stacking算法,探讨不同类型变量在山区地物识别中的效用,并对比Stacking最佳方案与支持向量机(Support Vector Machine,SVM)、随机森林(Random Forest,RF)和极端梯度回归(eXtreme Gradient Boosting,XGBoost)算法的分类精度,评价Stacking方法在山区地物信息提取中的性能。同时,引入SHAP方法,量化Stacking模型中各特征变量的重要性。结果表明:在仅以遥感变量为基础方案时,山区土地覆被分类精度最低;在分别加入气候、土壤和地形变量后,总体精度、Kappa系数和F1分数均有所提高,其中旱地、水田和园地分类精度的提升幅度较大。基于Stacking算法结合所有类型特征变量的方案达到了最佳的分类精度,其总体精度、Kappa系数和F1分数分别为96.61%、0.96和94.81%,分类精度优于相同特征下的SVM、 RF和XGBoost。SHAP方法可量化Stacking模型中特征变量的全局以及局部重要性,明确各变量对不同地物类型识别的相对贡献,为山区土地覆被分类的变量选择及优化提供有价值的信息。该研究可为机器学习协助复杂景观地区土地覆被制图研究提供技术支持和理论参考。

       

      Abstract: An accurate classification of land cover can greatly contribute to the basic dataset for regional ecological protection and environmental management. Remote sensing (RS) images are commonly used as the main data source for the extraction of land cover at present. However, there is a complex landscape, broken distribution of ground objects, frequent cloud cover, as well as serious radiometric distortion in the hilly and mountainous areas. Thus, it is difficult to accurately gain the distribution information of ground objects only by satellite images. Fortunately, the collaborative application of multi-source heterogeneous data can be expected to bridge the deficiency of a single data source, in order to accumulate more valuable information for the separability of ground objects. Great prospects can be realized to extract the land cover in areas with the complex surface landscape. In addition, the stacking algorithm with advanced machine learning can present superior and robust predictive performance in recent classification tasks. Therefore, the purpose of the current study is to explore the effectiveness of the multi-source heterogeneous data and stacking algorithm on land cover classification in hilly and mountainous areas. The study area was taken as the Qian Jiang District in Chongqing Province of China. Specifically, the various feature variables were extracted from the multi-source heterogeneous data, including the Sentinel-1/2 images, Digital Elevation Model (DEM), soil and climate data. Boruta method and Variance Inflation Factor (VIF) were applied to eliminate the redundant feature for the simple statistics. Then, five schemes with different inputs were created using the subset of the optimized variables, including the purely RS variables, RS variables plus climate factors, RS variables plus terrain parameters, RS variables plus soil parameters, and all variables. A stacking algorithm was also used to construct the classification model for the impacts of different types of variables on the classification accuracy of land cover. Meanwhile, the best classification using the stacking algorithm was compared with the Support Vector Machine (SVM), Random Forest (RF), and extreme gradient boosting (XGBoost). Additionally, a novel shapley addictive explanation (SHAP) was introduced to quantify the importance of variables in the model. The results showed that the overall accuracy, Kappa coefficient, and F1-score were significantly improved after the introduction of the climate, soil, and terrain variables. By contrast, the lowest classification accuracy of land cover was found in the model only using remote sensing variables. Among them, the soil variables contributed the most improvement, followed by the terrain, and climate variables. The classification accuracy of agricultural land types (dry farmland, paddy field, and orchard) was greater than that of the rest. The best classification accuracy was achieved in the experimental scheme with all feature variables, indicating an overall accuracy of 96.61%, Kappa of 0.96, and F1-score of 94.81%. The classification accuracy of the improved was higher than that of the SVM, RF, and XGBoost under the same variables. The SHAP technique can be expected to quantify and evaluate the global importance of each variable, indicating that the traditional vegetation and water spectral indicators were the most important feature variables. Besides, the local contribution of each variable for each land cover type can provide more value to optimize the parameters for the extraction of object information in hilly and mountainous areas. This finding can offer technical support and theoretical reference for land cover mapping in complex landscape areas.

       

    /

    返回文章
    返回