Abstract:
A phenology prediction was proposed to improve the accuracy and generalization in soybean breeding populations in this paper. The feature disentanglement was integrated with the principle of "causal invariance". Causality-based weighted auto-encoder network (CWAE) was combined with a phenology simulation to correct prediction errors in the DSSAT-CROPGRO model. The causal weights were then calculated using the Markov boundary and partial correlation coefficient, according to the simulated features, such as the thermal and photoperiod variables, as the inputs. The low-redundancy latent features were reconstructed via a weighted auto-encoding architecture, and then progressively transferred to the subsequent phenological stages. Model validation was conducted after prediction optimization. The field trial data were collected from four critical phenological stages-flowering (R1), pod beginning (R3), seed formation(R5), and maturity initiation (R7). A diverse panel of 309 soybean genotypes was also recorded from the YangtzeHuai Soybean breeding line population (YHSBLP). Data collection occurred at the Yancheng ecological site over four phenological stages (2018–2020). The thermal and photo-period effect variables demonstrably enhanced the feature discriminability. The latent features after CWAE extraction showed significantly reduced redundancy, with a substantial reduction of 70.59%, compared with the raw inputs. The progressive feature transfer effectively improved the predictive performance over all stages, thus resulting in the reduction of the root mean square error (RMSE) by 13% to 19% for phenology predictions. Comparative analysis against the standalone CROPGRO model revealed that there was the fusion model greatly enhanced the prediction accuracy. The RMSE was reduced from a range of 4.59-5.98 days down to 3.13-4.09 days, indicating a decrease of 23.37% to 31.81%. The fusion model also demonstrated the strong generalization under environmental deployment, particularly in the 2021 year with distribution shifts. Among them, the population-level RMSE decreased by 23.53% to 71.01%, and the genotype-level average RMSE was reduced from 5.45-12.41 days to 2.80-5.00 days. Notably, the effective calibration was achieved for over 80% of the genotypes within the population under the challenging cross-environment validation. As such, the highly relevant, low-redundancy, and robust latent features were successfully extracted from the crop simulations. By enabling extraction of highly relevant, low-redundancy robust features, the CWAE framework significantly enhances phenology prediction accuracy and generalization capability in soybean breeding populations. This framework provides an effective and practical approach for analyzing phenotypic responses of soybean breeding materials to varying temperature and photoperiod conditions, based on enhanced crop growth simulations.