Abstract:
Tomato is one of the favorite vegetables that is widely grown all over the world. Its harvesting is a typical labor-intensive task. Mechanical harvesting can be expected to promote agricultural development with labor-saving. However, the harvesting robots are often operated under complex greenhouse environments. Among them, an accurate three-dimensional (3D) pose information of the target fruits is required for the selective harvesting. There is a direct impact on the success rate of grasping and picking. In this study, a real-time estimation was proposed for the 3D pose of the tomato fruits using instance segmentation and spatial analysis. The 3D pose information of the target fruits was captured under such challenging conditions. A cascaded network architecture was constructed to combine a lightweight instance segmentation with an improved keypoint detection. A spatial parsing module was integrated with the point cloud in order to achieve high-precision and real-time pose estimation. Firstly, a lightweight instance segmentation model, YOLOv7-M1, was developed to improve the original YOLOv7-seg framework. The CBS backbone of the YOLOv7-seg was reconstructed using the MobileOne module. The computational cost was substantially reduced to maintain the high accuracy of the mask segmentation. The instance segmentation generated the pixel-level fruit masks and regions of interest (ROIs), thereby providing both the keypoint ROIs and prior knowledge of the fruit geometry and surrounding environment. Secondly, the keypoints within the ROIs were detected using an enhanced HRNet model, denoted HRNet-ECA. An Efficient Channel Attention (ECA) mechanism was embedded in this network after the parallel output branches at all four stages of the HRNet. The channel-wise feature selection was improved for the robustness of the keypoint localization in the clustered fruits, occlusions, and non-uniform illumination. A multi-modal framework of the data fusion was constructed to combine the depth map with the ROIs. The point clouds of the target fruits were generated and then processed via color filtering, outlier removal, down-sampling, and least-squares sphere fitting. Real-time geometric computation on the filtered point clouds also yielded the 3D pose of each tomato, including the spatial position and orientation. Experimental results showed that the lightweight instance segmentation successfully improved the real-time performance without compromising the accuracy. Compared with the classical instance segmentation, such as Mask-RCNN and Solov2 models, the improved YOLOv7-M1 model achieved better performance: Average precision increased by 14.35 and 15.21 percentage points, precision by 14.06 and 14.47 percentage points, recall by 13.25 and 11.10 percentage points, and mean Average Precision (mAP
50) by 14.30 and 14.51 percentage points, respectively. Furthermore, the GFLOPs of the YOLOv7-M1 were reduced by 30.23, 54.75, and 29.99 percentage points, while the frame rate increased by 10.45, 14.16, and 8.34 percentage points, respectively, compared with the YOLOv7-seg, YOLOv8l-seg, and YOLOv11l-seg. In the keypoint detection stage, three attention mechanisms—SE, CBAM, and ECA—were embedded into the HRNet for comparison. The similarity of the keypoint was improved by 0.87, 1.68, and 2.23 percentage points, respectively. The ECA provided the largest accuracy, while the GFLOPs increased by less than 1 percentage points, thus meeting real-time constraints. The overall framework of the pose estimation was further validated on 100 groups of tomato data. The average accuracy of the pose estimation reached 95.00 percentage points, the mean 3D orientation error was 9.40°, and the mean error of the keypoint localization was 4.13 mm. The average position errors in the
X,
Y, and
Z directions were 3.41, 2.95, and 1.02 mm, respectively. The average processing time per fruit was 0.063 s. Finally, the improved model was deployed on a tomato harvesting robot and then tested in a real greenhouse. Among 50 harvesting trials, 44 were successful, corresponding to a success rate of 88.00 percentage points. These findings can provide reliable information on the fruit pose to guide the end effector. A favorable balance between accuracy and real-time performance can offer an efficient solution for the precise and mechanical harvesting in complex agricultural environments.