Abstract
Accurate and efficient prediction is often required for the large-scale yield of winter wheat in modern agriculture. In this study, a deep learning model was proposed to predict the winter wheat yield at the county level using multimodal data fusion. The wide-scale data was also acquired from the surface reflectance and the Global Land Data Assimilation System (GLDAS). Multi-source data was effectively integrated to enhance the accuracy of the prediction, including surface reflectance (seven bands, such as red, blue, and short-wave infrared), meteorological data (eight bands, such as air temperature, near-surface wind speed, precipitation, and snow water equivalent), soil data (four bands, such as soil temperature and soil moisture), and historical yield data. Four modules consisted of: A multimodal feature extraction encoder, a historical yield temporal feature extraction, a dynamic gated feature fusion module (DGFM), and a yield prediction module. Among them, the multimodal feature extraction encoder was to extract the spatiotemporal feature representations from the input surface reflectance data and GLDAS data, respectively. A dual-branch architecture was adopted in the encoder. The encoding process was divided into two stages: the spatial and the temporal feature modeling. In the spatial feature modeling, the ResNet was used as the backbone network, with its fully connected layers at the end removed to facilitate feature extraction. A Convolutional Block Attention Module (CBAM) was embedded after each residual block, in order to suppress the interference from background regions in the remote sensing images. The key regions and channels were focused after enhancement. In the temporal feature modeling, the feature vectors from the previous stage were reorganized sequentially by time step, and then the positional encoding was introduced to preserve their temporal relationships. A five-layer cross-branch cross-attention block was used to capture the temporal dependencies at the key growth stages of winter wheat. A Gated Recurrent Unit (GRU) network was incorporated to extract the temporal features from historical yields. A dynamic gated fusion module (DGFM) was then adaptively fused with the multi-source features, which more accurately reflected the long-term trends and inter-annual fluctuations. In the data, a dataset was constructed to evaluate the model's performance in the main winter wheat production areas of northern China, including Shanxi Province (Linfen, Yuncheng, Jincheng, Changzhi, and Jinzhong cities), Henan Province (all regions), Shaanxi Province (Guanzhong region), and Gansu Province (Tianshui, Pingliang, Qingyang, and Longnan cities). Experimental results show that the coefficient of determination (R2) was 0.909, and the root mean squared error (RMSE) and mean absolute error (MAE) were 512.97 and 377.08 kg/km2, respectively. The accuracy was significantly better than the conventional machine learning. Compared with the high-performing CNN-LSTM-Attention model, the model increased the R2 by 6.69%, while decreasing RMSE and MAE by 18.28% and 15.51%, respectively. Ablation test results show that the RMSE increased to 601.54 kg/km2 after removing the CBAM module. After removing the DGFM and CCAB modules, the RMSE increased by 75.21 and 80.18 kg/km2, respectively, the MAE increased by 9.41 and 14.37 kg/km2, respectively, and the R2 decreased by 0.038 and 0.033, respectively. All modules greatly contributed to the accuracy of early prediction. The RMSE value of 531.95 kg/km2 was found for the predictions using data from sowing to heading stage, which was 3.6% lower than that from the full growth period. It infers that the accurate prediction was approximately 50 days before the winter wheat harvest. The contribution analysis of different data sources showed that the surface reflectance data shared a high contribution to the prediction of winter wheat yield, with the RMSE, MAE, and R2 values of 623.15 kg/km2, 476.52 kg/km2, and 0.778, respectively. As such, the surface reflectance data can more directly represent the crop growth status. Model robustness results show that the average RMSE and MAE were 561.34 kg/km2 and 398.31 kg/km2, respectively, and the average R2 was 0.876 in the annual validation from 2015 to 2021. The prediction performance fluctuated less between different years, indicating the inter-annual robustness. The high accuracy and robustness were then achieved in the prediction of the winter wheat yield. The finding can also provide a promising approach for an accurate and reliable crop yield forecasting.