Abstract
Accurate classification of maize drought phenotypes from in-field visible-light imagery is essential for irrigation decision support and yield-loss mitigation, yet the task remains challenging under practical farmland conditions where scale variation, background clutter, and illumination fluctuations jointly degrade feature stability and inter-class separability. To address these difficulties, a multi-source maize drought image dataset was established and an improved Swin Transformer–based framework was developed for nine-class phenotype classification across multiple growth stages and drought severities. The original collection contained 948 images acquired from three channels, including fixed high-definition cameras deployed at agricultural research stations, in-field smartphone photography, and publicly accessible online repositories. After strict manual screening, 132 low-quality or redundant images (13.9% of the originals) were removed, leaving 816 high-quality samples. To enhance robustness to acquisition variability, data augmentation operations—random rotations, salt-and-pepper noise injection, and adaptive histogram equalization—were applied, expanding the dataset to 3 628 images. Each image was annotated into one of nine categories defined by the intersection of three drought severity levels (no drought, mild drought, severe drought) and three growth stages (jointing, tasseling, maturity); the final dataset was split into training, validation, and test subsets in a 7:1:2 ratio (2 542/363/723). As the backbone, Swin Transformer, a hierarchical Vision Transformer variant with shifted-window self-attention, was adopted to balance contextual modeling and computational efficiency for high-resolution field images. Building upon the baseline, three complementary modifications were incorporated at the Transformer block level: 1) residual post-normalization, placing layer normalization after residual connections to stabilize feature scaling in deep layers; 2) scaled cosine attention, replacing dot-product similarity with cosine similarity and a learnable scaling factor to reduce sensitivity to activation magnitude and improve similarity measurement under varying illumination; and 3) log-spaced continuous position bias (Log-CPB), generating continuous relative position biases via logarithmic coordinate mapping to facilitate transfer across different window sizes and image resolutions. In addition, a multi-scale dilated fusion attention (MDFA) module was designed to strengthen drought-relevant representation by integrating four parallel dilated convolution branches (dilation rates 1, 6, 12, and 18) and fusing channel attention and spatial attention to emphasize informative channels and locations associated with leaf curling, chlorosis/yellowing, and drying symptoms. The resulting model was termed Swin Transformer with multi-scale dilated fusion attention (SWT-MDFA). Ablation experiments quantified the contribution of each component on the test set: the baseline Swin Transformer achieved 95.9% accuracy, 95.4% precision, and 96.2% recall; residual post-normalization increased accuracy to 96.1%; scaled cosine attention improved accuracy to 96.2% and recall to 96.6%; Log-CPB mainly benefited recall to 96.8% under complex backgrounds; and MDFA produced the largest single-module gain, reaching 97.0% accuracy with 96.3% precision and 97.2% recall. With all components integrated, SWT-MDFA achieved the best overall performance, attaining 97.4% accuracy, 97.4% precision, and 97.3% recall, corresponding to improvements of 1.5, 2.0, and 1.1 percentage points over the baseline, respectively. Comparative evaluation against representative convolutional backbones—SE-ResNet, DenseNet-121, Xception, and Res2Net-50—confirmed the superiority of SWT-MDFA, whose accuracy exceeded that of Res2Net-50 (95.0%), Xception (93.6%), SE-ResNet (92.4%), and DenseNet-121 (91.2%). In terms of computational cost, SWT-MDFA required 88.9 million parameters and 48.9 giga floating-point operations (FLOPs), indicating that substantial accuracy gains were achieved with a manageable computational budget. Confusion-matrix analysis on the 723-image test set (704 correctly classified, 97.4% overall accuracy) further showed that misclassifications were mainly concentrated between adjacent drought severities or neighboring growth stages, whereas classes with more distinctive phenotypes achieved error-free recognition. Visualization-based analysis using gradient-weighted class activation mapping further suggested that the model consistently concentrated on drought-relevant leaf regions while suppressing irrelevant background responses, supporting interpretable and reliable phenotype classification across growth stages and severity levels.