Abstract:
Soil total nitrogen (TN) content is a critical indicator for assessing soil nutrient status. However, under small sample size conditions, the accuracy of inverting soil TN content using visible and near-infrared (Vis-NIR) spectroscopy is often unsatisfactory. To address this challenge, this study proposes a novel data augmentation framework based on generative sdversarial networks (GANs). Specifically, to improve the quality of the generated spectral data, a conditional generative adversarial network (CGAN) architecture is employed, which guides the generation process through relevant auxiliary information. The study evaluates three types of adversarial generative networks: a standard GAN, a label-conditional generative adversarial network (LCGAN) that uses soil TN content values as conditional labels, and a VIP-CGAN that employs feature wavelength sets selected based on the extremum method of variable importance in projection (VIP) scores as conditional vectors. Building on this, the feature wavelength selection method is refined by using the extremum method to identify appropriate extreme points on the VIP score curve and extending outward, thereby selecting feature wavelengths less affected by noise and external environmental interference. It is validated that this approach is more effective than using wavelengths with higher VIP scores alone.The experimental dataset consists of Vis-NIR spectra collected in situ from agricultural soils and corresponding laboratory-measured TN content. Through a comprehensive evaluation method combining qualitative and quantitative assessments, the fidelity of synthetic samples generated by the standard GAN, LCGAN, and various configurations of VIP-CGAN is compared and analyzed. Quantitative evaluation results show that the VIP-CGAN variant constructed based on 9 extended feature wavelength bands (referred to as VIP-CGAN(T9)) performs the best. The generated samples achieve maximum mean discrepancy (MMD) and fréchet inception distance (FID) scores as low as 0.003 and 0.005, respectively. These values indicate a high statistical consistency between the generated data and the original data distribution, confirming the model's ability to fully learn the relationship between constraints and features and generate realistic and reliable synthetic spectra.To evaluate the effect of data augmentation, an enhanced dataset is constructed by combining real samples with synthetic samples generated by VIP-CGAN. The predictive performance of three regression models—partial least squares regression (PLSR), support vector regression (SVR), and a one-dimensional convolutional neural network (1D-CNN)—is systematically tested. When using synthetic samples generated by VIP-CGAN(T9) at a proportion of 300% (three times the size of the original training set), all established models achieve optimal performance. The PLSR model attains a coefficient of determination (R
2) of 0.86 with a root mean square error (RMSE) of 0.028 g/kg; the SVR model achieves an R
2of 0.84 and an RMSE of 0.009 g/kg; and the 1D-CNN model performs best, with an R
2 of 0.88 and an RMSE of 0.026 g/kg. The results demonstrate significant improvement over baseline models trained solely on the original limited dataset. By conditioning the generator on VIP-selected wavelengths, the model is guided to focus on spectral regions associated with chemical information related to nitrogen compounds and organic matter, forming a physically meaningful constraint mechanism. The resulting augmented data creates a more robust training environment for regression models, effectively mitigating overfitting and improving generalization .In conclusion, this study establishes an effective framework for enhancing the hyperspectral inversion accuracy of soil TN content under small sample size conditions. The proposed method provides a solution to the challenge of small samples in soil Vis-NIR spectroscopy analysis, with potential future applications in analyzing other soil properties and exploring more advanced generative architectures.