Abstract:
Accurate detection of small-volume targets, such as tree crowns and pedestrians, remains a significant challenge for the orchard robots in complex nursery environments. The high performance of detection is often required for agricultural settings. In this study, an enhanced version of the VoteNet model was proposed for the object detection of 3D point clouds. Three major modifications were utilized, according to the original VoteNet architecture. Firstly, the original module of asymmetric feature extraction was replaced by a symmetric feature extraction module (SFEM). A U-Net-like encoder-decoder structure was also adopted to integrate the multi-scale features. The symmetric design was used to more effectively fuse the low-level spatial and high-level semantic information. Thereby, the finer geometric features were preserved to detect the small objects. Secondly, a reverse attention feature fusion module (RAFFM) was introduced at the skip connections. The local feature representation was then enhanced using self-attention mechanisms and trilinear interpolation. The fine-grained structures were effectively highlighted to improve the feature consistency at the different scales. The network was focused on more discriminative regions of the small targets. Thirdly, a center point discrepancy loss (CPD-Loss) was incorporated to minimize the spatial offset between predicted proposal centers and ground-truth bounding box centers. The high accuracy was obtained for the object localization. This additional loss term regularized the voting process, leading to more stable cluster formation around object centroids. A series of experiments was conducted to validate the approach. A nursery point cloud dataset was constructed with three types of trees and pedestrians under 1147 scenes, according to the KITTI format. The dataset was partitioned into 60% for training, 20% for validation, and 20% for testing, in order to ensure fair evaluation. The good performance of the model was achieved, with an average recall (AR) of 88.06% and a mean average precision (mAP) of 55.05% at an IoU threshold of 0.25, which significantly outperformed the baseline VoteNet by 9.05% and 22.47% in AR and mAP, respectively. Particularly, the small objects were detected dramatically: The average precision (AP) for Ilex cornuta var fortunei increased from 4.40% to 28.77%, respectively, indicating a 6.5-fold enhancement. Each component was individually contributed after the ablation test. The SFEM provided more discriminative feature learning for the objects with sparse point distributions; The RAFFM effectively aggregated the contextual features to preserve the geometric details, leading to the higher recall for the pedestrians and small tree crowns; And the CPD-Loss further increased the accuracy of bounding box regression, in order to stabilize the training convergence. It was also compared with the traditional target detection network.The good performance was achieved under the nursery environments. The 3D detection framework can offer a robust perception solution for the autonomous orchard robots. More accurate environmental detection was also obtained in complex nursery settings. The detection facilitated the small-volume targets at various precision levels of the agriculture applications, such as targeted spraying, growth monitoring, and safe navigation around pedestrians. This work can contribute to the advancement of intelligent agricultural systems. The finding can provide a reliable 3D perception technology for efficient and autonomous nursery operations. Future research can optimize the network architecture for real-time performance, and then extend into the rest agricultural scenarios.