基于三维点云的葡萄采摘场景感知与采摘点定位方法

    An approach for grape harvest scene perception and picking point localization using 3D point clouds

    • 摘要: 为实现非结构化果园环境中葡萄簇三维采摘点的高精度定位,该研究提出了一种融合3D点云与2D图像的视觉感知与认知方法。首先,采用Point Transformer V2模型对采摘场景进行精细语义分割,为后续聚类与采摘点定位提供语义支撑。其次,结合葡萄簇形态特征,提出了三维空间下的葡萄簇采摘点定位算法(3D grape picking point localization algorithm, 3D GPPLA),利用DBSCAN与K-Means的双阶段聚类策略,有效分离多串葡萄簇并实现单串葡萄的三维采摘点定位。为应对三维定位失败,进一步引入基于RGB图像的补偿机制,通过SegFormer模型实现二维语义感知,并结合二维采摘点定位算法(2D grape picking point localization algorithm, 2D GPPLA)完成坐标投影与三维精度补全。试验结果表明,Point Transformer V2在语义分割任务中的mIoU达89.83%,其中果梗与枝干类别IoU分别为78.55%与84.20%。在1847簇葡萄样本试验中,3D GPPLA算法在单簇与多簇场景下的采摘点定位成功率分别为98.81%和80.95%,总体达89.11%。结果验证了所提方法在三维采摘点定位中的高精度与鲁棒性,为葡萄采摘机器人视觉系统优化及非结构化环境下的低损采摘提供了技术支撑。

       

      Abstract: Accurate recognition of grape picking points is essential for enabling intelligent, efficient, and non-destructive harvesting in automated grape-picking robots. However, under unstructured orchard environments, factors such as occlusions, irregular lighting, and the complex spatial distribution of grape clusters significantly hinder the robustness of 3D localization and reduce the overall reliability of harvesting decisions. To address these challenges, this study proposes a dual-modal visual perception and cognition framework that integrates both 3D point clouds and 2D RGB images for robust and precise picking point localization across diverse orchard conditions. The framework begins with comprehensive 3D semantic scene understanding based on Point Transformer V2 (PTV2), a point-cloud processing model that incorporates grouped vector attention and relative positional encoding to capture both local geometric structures and long-range contextual dependencies. Point clouds acquired from a depth camera are semantically segmented into classes such as grapes, stems, and branches, forming the structural foundation for subsequent geometric analysis. The PTV2 model achieves high segmentation accuracy, reaching a mean Intersection over Union (mIoU) of 89.83%, with IoU values of 78.55% and 84.20% for stem-related categories, demonstrating strong recognition capability in real orchard scenarios. Building on the semantic segmentation output, a 3D Grape Picking Point Localization Algorithm (3D GPPLA) is proposed to determine picking points within complex grape cluster arrangements. The algorithm introduces a two-stage clustering strategy based on DBSCAN and K-Means to separate semantically segmented multi-cluster grapes into independent candidate cluster point clouds, followed by a morphology-based validation procedure to determine whether each cluster corresponds to an individual grape bunch. To ensure computational efficiency and prevent over-segmentation, a restriction on recursive depth is imposed. If no valid partition is obtained within the allowed depth, the system rolls back to a previous clustering state to maintain stability. Once a single grape bunch is identified, 3D GPPLA estimates the picking point by analyzing the spatial relationship between the grape centroid and the peduncle region. Specifically, the algorithm computes a minimum bounding box around the grape-peduncle subset and determines the optimal picking direction by evaluating peduncle proximity and accessibility, ensuring minimal damage during separation and consistent harvesting performance. To further enhance robustness in cases where the 3D approach fails due to severe occlusion, missing depth data, or segmentation noise, a complementary 2D fallback strategy is introduced. When a failure is detected, the system extracts the corresponding 2D RGB image and switches to image-based inference. Using SegFormer—an advanced transformer-based semantic segmentation network—the 2D image is segmented into high-fidelity grape and peduncle classes. The 2D GPPLA algorithm then computes picking points in the image space using shape heuristics and spatial priors, and subsequently projects them into the 3D point cloud through depth-aligned pixel mapping. This fallback mechanism enhances resilience in cluttered and partially observable environments while leveraging the richer texture and color cues available in RGB images to compensate for limitations in point-cloud resolution and sensor noise. The proposed method is evaluated on a custom dataset comprising 1,847 grape clusters collected under natural orchard conditions. The 3D GPPLA achieves a picking-point localization success rate of 89.11%. In particular, the algorithm attains success rates of 98.81% in single-cluster scenarios and 80.95% in multi-cluster arrangements, highlighting its adaptability to varying levels of structural complexity. When combined with the 2D fallback strategy, the system demonstrates high overall reliability, significantly reducing failure cases in cluttered and occluded scenarios. By integrating advanced 3D semantic segmentation, adaptive multi-stage clustering, and cross-modal compensation, the proposed framework achieves accurate, stable, and efficient picking-point localization in unstructured vineyard environments. This work lays a solid technical foundation for the practical deployment of grape-harvesting robots and contributes meaningfully to the broader advancement of smart agriculture and robotic fruit-picking systems.

       

    /

    返回文章
    返回