Abstract:
Pile picking is commonly used for green Sichuan pepper in southern China. Complex stacking scenarios can be formed by the pruned prickly ash branches, including branches, fruits, and leaves. Such stacking scenes have limited the high level of automation. However, the existing harvesters cannot fully meet the large-scale production during pile picking. It is often required to recognize and locate the prickly ash branches for the high grasping efficiency. In this study, a grasping sequence reasoning was proposed using an improved YOLOv8-Seg network. The network structure was optimized to enhance the perception and integration of the multi-scale features. Specifically, a convolutional block attention module (CBAM) was embedded before the feature concatenation in the C2f modules, corresponding to the P4 and P5 layers of the backbone network. The attention weights of feature maps were adjusted adaptively. Some features of the targets were integrated to strengthen at different spatial positions. Meanwhile, the original spatial pyramid pooling-fast (SPPF) module was replaced by an atrous spatial pyramid pooling (ASPP) module. The network was also reinforced to represent both local and global contextual features. The higher precision and robustness were also achieved in segmenting the occluded targets. A grasping score function was further developed for the complex stacking of the prickly ash branches. Three key factors were considered, including the branch-to-camera distance, mask completeness, and the entanglement risk between neighboring branches. A Bayesian optimization approach was also applied to determine the optimal weight coefficients of these factors, which were 0.797, 0.183, and 0.020, respectively. These coefficients were integrated with the depth information. The grasping score was computed to infer the optimal grasping sequence and then efficiently prioritize among stacked branches. Experimental results showed that the improved model significantly enhanced the performance of the branch recognition and grasping sequence under various stacking conditions. The mean intersection over union (mIoU) and mean pixel accuracy (mPA) reached 86.68% and 91.04%, respectively. The precision, recall, and F1-score were 95.70%, 91.04%, and 92.82%, respectively, indicating an increase of 9.74%, 9.44%, and 4.99%, respectively, compared with the original model. Furthermore, the superior performance was achieved in the segmentation accuracy, boundary recognition, and robustness against occlusion, compared with the mainstream instance segmentation models, such as the Mask R-CNN, YOLACT, and YOLOv5. Grasping experiments were conducted on the practical harvesting operations in order to verify the effectiveness of the improved model. An AUBO-i10 robotic arm was equipped with a two-finger gripper and an Intel RealSense D435i depth camera in an eye-in-hand configuration. The robotic system successfully performed the detection, recognition, reasoning, and grasping of the prickly ash branches. The grasping success rate reached 75.86%, and the sequence reasoning accuracy was 86.21%. The feasibility and stability of the approach were obtained in the complex stacking scenarios. The reasoning strategy can be effectively applied to the grasping sequence inference for the intelligent prickly ash harvesters. The finding can also provide important references to optimize the automatic harvesting for green Sichuan pepper.