Abstract:
Soybean is one of the most important economic crops worldwide, particularly for its major source of plant protein. It is often required to screen the high-yielding and high-quality varieties in crop breeding. Among them, the phenotypic traits of the soybean plants are closely related to the yield. But conventional measurement has hindered the widespread application in actual production, such as the high subjectivity, labor intensity, and susceptibility to errors. Especially, existing convolutional neural networks (CNN) cannot fully meet the requirements of the pod morphology and seed counting. Meanwhile, the detection of the main stem and branches is easily affected by the occlusion and curvature interference, resulting in low measurement accuracy. In this study, an enhanced real-time detection transformer (RT-DETR) algorithm was introduced to improve the detection accuracy of the soybean phenotypic traits. In pod detection, an attention-scale sequence fusion (ASF) module was integrated into the Transformer architecture of the RT-DETR. The performance of the target recognition was significantly enhanced in the complex environments using multi-scale feature fusion and a dual attention mechanism. Three core parts were divided into the module: (1) The Scale Sequential Feature Fusion (SSFF) module fused the multi-scale feature maps from the layers P3, P4, and P5 using 3D convolution and upsampling techniques. Thereby, the scale-invariant features were extracted to simultaneously detect the pods with the different sizes (large, medium, and small); (2) The Three-Scale Feature Encoding (TFE) module was used to uniformly segment the features from the three scales to the same resolution before concatenation. The fused features contained both detailed and contextual information in order to improve the representation of the dense, overlapping, and small pods. (3) The Channel and Position Attention (CPAM) module was used to select the highly discriminative feature channels and spatial attention. The target region was focused on suppressing the background interference, thus achieving more accurate localization and classification. The ASF module was provided with richer gradient information during training. The convergence speed and stability were enhanced to recognize the pod phenotypic features in complex backgrounds. A Wavelet Feature Upgrade (WFU) module was designed to detect the main stems and branches. Multi-scale decomposition of the image was performed using the wavelet transform. Multi-scale information was effectively utilized to integrate the high- and low-frequency features in the decoder. The key features were learned to reduce the distortion after the image analysis. Thereby the better performance was achieved to improve the sensitivity to the target shape and boundaries. The WFU module was constructed to enhance the feature network, compared with the conventional convolution. Two-dimensional wavelets were used to decompose the image into the low-frequency (background) and high-frequency (target edge) components, and then routing them into two branches—MobileNet (using large convolutional kernels and LayerNorm for the background suppression) and ConvNext (using small convolutional kernels and ReLU activation to enhance the details). The inverse wavelet transform was used to decouple the background and target. The encoder layer features were firstly used to connect with the decoder upsampled features, and then divided into two paths: one was to extract the deep fine-grained semantic information using a lightweight inverse residual structure, and another was to preserve the spatial details. After summing the residuals, a cascaded inverse residual structure significantly reduced the false negative rate for the fragmented and elongated targets. During upsampling, the dual-path architecture was employed using parallel processing: one path used a 7×7 depthwise separable convolution with a two-layer FC-GELU activation function for the long-range spatial compensation, while another used a transposed convolution with a 3×3 DWConv for the resolution restoration. The output of the residual fusion was used to supplement the high-frequency boundary information. While the mesh artifacts were avoided after transposed convolution. Thus, the coherent target edges were generated with sub-pixel accuracy and a low number of parameters. Experimental results show that the improved RT-DETR algorithm achieved an accuracy of 0.911 of the soybean pod detection and 0.940 of the main stem and branch detection. Furthermore, the morphological parameters of the main stem and branches were extracted using OpenCV. According to the phenotypic features, such as the pod number, seed number, and main stem/branch area, a voting regression ensemble model was constructed to accurately predict the weight per plant (
R2= 0.90). Thereby, the yield estimation was realized after prediction. The soybean phenotypic analysis and yield prediction can provide reliable technical support for soybean breeding and cultivation optimization. The finding can also offer a technical approach for crop phenomics.