实例分割与空间解析融合的番茄实时三维位姿融合估计方法

苟豪; 赵国瑞; 董适; 吕生华; 林晨; 文剑

doi:10.11975/j.issn.1002-6819.202505251

实例分割与空间解析融合的番茄实时三维位姿融合估计方法

Real-time estimating 3D pose of tomato fruits using instance segmentation and spatial analysis

摘要

摘要: 针对复杂种植环境中现有果实位姿估计方法精度低、实时性差等问题，该研究提出一种融合轻量化实例分割与空间解析的番茄实时三维位姿估计方法。通过构建改进的YOLOv7-M1轻量化网络，实现果实掩膜高精度提取与关键点感兴趣区域快速定位；设计HRNet-ECA嵌入高通量注意力机制提升检测准确率；搭建多模态数据融合框架，结合深度图与感兴趣区域，经点云滤波处理和空间几何计算实时获取果实三维位姿参数。试验结果表明，改进后的YOLOv7-M1掩膜分割平均精度为95.56%，召回率93.52%，准确率96.17%；改进后的HRNet-ECA关键点相似度为96.61%，位姿估计准确率95.0%，三维姿态角平均误差9.40°，关键点平均定位误差4.13 mm，关键点在X、Y、Z方向上的平均误差分别为3.41、2.95和1.02 mm。单果处理平均耗时0.063 s。该方法构建了轻量化实例分割网络与改进关键点检测模型的级联结构，结合点云空间解析，在保证精度指标同时兼顾实时效率，实现了番茄果实的高精度实时位姿估计，可为复杂农业场景下果蔬精准自动化采收提供高效的解决方案。

Abstract: Tomato is one of the favorite vegetables that is widely grown all over the world. Its harvesting is a typical labor-intensive task. Mechanical harvesting can be expected to promote agricultural development with labor-saving. However, the harvesting robots are often operated under complex greenhouse environments. Among them, an accurate three-dimensional (3D) pose information of the target fruits is required for the selective harvesting. There is a direct impact on the success rate of grasping and picking. In this study, a real-time estimation was proposed for the 3D pose of the tomato fruits using instance segmentation and spatial analysis. The 3D pose information of the target fruits was captured under such challenging conditions. A cascaded network architecture was constructed to combine a lightweight instance segmentation with an improved keypoint detection. A spatial parsing module was integrated with the point cloud in order to achieve high-precision and real-time pose estimation. Firstly, a lightweight instance segmentation model, YOLOv7-M1, was developed to improve the original YOLOv7-seg framework. The CBS backbone of the YOLOv7-seg was reconstructed using the MobileOne module. The computational cost was substantially reduced to maintain the high accuracy of the mask segmentation. The instance segmentation generated the pixel-level fruit masks and regions of interest (ROIs), thereby providing both the keypoint ROIs and prior knowledge of the fruit geometry and surrounding environment. Secondly, the keypoints within the ROIs were detected using an enhanced HRNet model, denoted HRNet-ECA. An Efficient Channel Attention (ECA) mechanism was embedded in this network after the parallel output branches at all four stages of the HRNet. The channel-wise feature selection was improved for the robustness of the keypoint localization in the clustered fruits, occlusions, and non-uniform illumination. A multi-modal framework of the data fusion was constructed to combine the depth map with the ROIs. The point clouds of the target fruits were generated and then processed via color filtering, outlier removal, down-sampling, and least-squares sphere fitting. Real-time geometric computation on the filtered point clouds also yielded the 3D pose of each tomato, including the spatial position and orientation. Experimental results showed that the lightweight instance segmentation successfully improved the real-time performance without compromising the accuracy. Compared with the classical instance segmentation, such as Mask-RCNN and Solov2 models, the improved YOLOv7-M1 model achieved better performance: Average precision increased by 14.35 and 15.21 percentage points, precision by 14.06 and 14.47 percentage points, recall by 13.25 and 11.10 percentage points, and mean Average Precision (mAP₅₀) by 14.30 and 14.51 percentage points, respectively. Furthermore, the GFLOPs of the YOLOv7-M1 were reduced by 30.23, 54.75, and 29.99 percentage points, while the frame rate increased by 10.45, 14.16, and 8.34 percentage points, respectively, compared with the YOLOv7-seg, YOLOv8l-seg, and YOLOv11l-seg. In the keypoint detection stage, three attention mechanisms—SE, CBAM, and ECA—were embedded into the HRNet for comparison. The similarity of the keypoint was improved by 0.87, 1.68, and 2.23 percentage points, respectively. The ECA provided the largest accuracy, while the GFLOPs increased by less than 1 percentage points, thus meeting real-time constraints. The overall framework of the pose estimation was further validated on 100 groups of tomato data. The average accuracy of the pose estimation reached 95.00 percentage points, the mean 3D orientation error was 9.40°, and the mean error of the keypoint localization was 4.13 mm. The average position errors in the X, Y, and Z directions were 3.41, 2.95, and 1.02 mm, respectively. The average processing time per fruit was 0.063 s. Finally, the improved model was deployed on a tomato harvesting robot and then tested in a real greenhouse. Among 50 harvesting trials, 44 were successful, corresponding to a success rate of 88.00 percentage points. These findings can provide reliable information on the fruit pose to guide the end effector. A favorable balance between accuracy and real-time performance can offer an efficient solution for the precise and mechanical harvesting in complex agricultural environments.

HTML全文

参考文献(32)

施引文献

资源附件(0)