Abstract:
As an important economic crop and a major source of plant protein for diverse populations worldwide, the screening of high-yielding and high-quality soybean varieties has always been a priority in crop breeding research. Phenotypic traits of soybean plants are closely related to their yield, but traditional measurement methods have inherent limitations such as high subjectivity, high labor intensity, and susceptibility to errors. Especially in pod morphology and seed counting, existing convolutional neural network methods perform poorly. Meanwhile, the detection of the main stem and branches is easily affected by occlusion and curvature interference, resulting in low measurement accuracy and hindering its widespread application in actual production. To address these issues, this study introduces an enhanced real-time detection transformer (RT-DETR) algorithm to improve the detection accuracy of soybean phenotypic traits. In pod detection, this study integrates an attention-scale sequence fusion (ASF) module into the Transformer architecture of RT-DETR. Through multi-scale feature fusion and a dual attention mechanism, the model's target recognition performance in complex environments is significantly enhanced. This module comprises three core parts: the Scale Sequential Feature Fusion (SSFF) module utilizes 3D convolution and upsampling techniques to fuse multi-scale feature maps from layers P3, P4, and P5, thereby extracting scale-invariant features and enabling the network to simultaneously detect pods of different sizes (large, medium, and small); the Three-Scale Feature Encoding (TFE) module uniformly scales features from the three scales to the same resolution before concatenation, ensuring that the fused features contain both detailed and contextual information, improving the representation ability of dense, overlapping, and small pods; and the Channel and Position Attention (CPAM) module uses channel attention to select highly discriminative feature channels and spatial attention to focus on the target region while suppressing background interference, thus achieving more accurate localization and classification. During training, the ASF module provides richer gradient information, improving the model's convergence speed and stability, thereby enhancing its performance in recognizing pod phenotypic features in complex backgrounds. For the task of detecting main stems and branches, this study designed a WFU (Wavelet Feature Upgrade) module. This module performs multi-scale decomposition of the image based on wavelet transform, effectively utilizing multi-scale information in the decoder by integrating high-frequency and low-frequency features separately. This not only enhances the network's ability to learn key features but also reduces distortion in the image analysis process, thereby improving the model's sensitivity to target shape and boundaries. Compared with traditional convolution operations, the WFU module constructs a novel feature enhancement network: using two-dimensional wavelets to decompose the image into low-frequency (background) and high-frequency (target edge) components, and then routing them to two dedicated branches—MobileNet (using large convolutional kernels and LayerNorm for background suppression) and ConvNext (using small convolutional kernels and ReLU activation to enhance details). Inverse wavelet transform is used to decouple the background and target. This method first concatenates the encoder layer features with the decoder upsampled features, and then divides them into two paths: one extracts deep fine-grained semantic information through a lightweight inverse residual structure, and the other preserves spatial details. After summing the residuals, a cascaded inverse residual structure significantly reduces the false negative rate for fragmented and elongated targets. During upsampling, the dual-path architecture employs a parallel processing strategy: one path uses a 7×7 depthwise separable convolution with a two-layer FC-GELU activation function for long-range spatial compensation, while the other uses transposed convolution with a 3×3 DWConv for resolution restoration. The output of residual fusion supplements high-frequency boundary information while mitigating mesh artifacts that may be introduced by transposed convolution, thus generating clear, coherent target edges with sub-pixel accuracy without significantly increasing the number of parameters. Experimental results show that the improved RT-DETR algorithm achieves an accuracy of 91.1% in soybean pod detection and 94.0% in main stem and branch detection. Furthermore, morphological parameters of the main stem and branches were extracted using OpenCV. Based on the obtained phenotypic features such as pod number, seed number, and main stem/branch area, a voting regression ensemble model was constructed, which accurately predicted the weight per plant (R
2 = 0.90), thereby enabling yield estimation. The soybean phenotypic analysis and yield prediction method proposed in this study provides reliable technical support for soybean breeding and cultivation optimization, and also offers a new technical approach for crop phenomics research.