Abstract:
Accurate recognition of grape picking points is essential for enabling intelligent, efficient, and non-destructive harvesting in automated grape-picking robots. However, under unstructured orchard environments, factors such as occlusions, irregular lighting, and the complex spatial distribution of grape clusters significantly hinder the robustness of 3D localization and reduce the overall reliability of harvesting decisions. To address these challenges, this study proposes a dual-modal visual perception and cognition framework that integrates both 3D point clouds and 2D RGB images for robust and precise picking point localization across diverse orchard conditions. The framework begins with comprehensive 3D semantic scene understanding based on Point Transformer V2 (PTV2), a point-cloud processing model that incorporates grouped vector attention and relative positional encoding to capture both local geometric structures and long-range contextual dependencies. Point clouds acquired from a depth camera are semantically segmented into classes such as grapes, stems, and branches, forming the structural foundation for subsequent geometric analysis. The PTV2 model achieves high segmentation accuracy, reaching a mean Intersection over Union (mIoU) of 89.83%, with IoU values of 78.55% and 84.20% for stem-related categories, demonstrating strong recognition capability in real orchard scenarios. Building on the semantic segmentation output, a 3D Grape Picking Point Localization Algorithm (3D GPPLA) is proposed to determine picking points within complex grape cluster arrangements. The algorithm introduces a two-stage clustering strategy based on DBSCAN and K-Means to separate semantically segmented multi-cluster grapes into independent candidate cluster point clouds, followed by a morphology-based validation procedure to determine whether each cluster corresponds to an individual grape bunch. To ensure computational efficiency and prevent over-segmentation, a restriction on recursive depth is imposed. If no valid partition is obtained within the allowed depth, the system rolls back to a previous clustering state to maintain stability. Once a single grape bunch is identified, 3D GPPLA estimates the picking point by analyzing the spatial relationship between the grape centroid and the peduncle region. Specifically, the algorithm computes a minimum bounding box around the grape-peduncle subset and determines the optimal picking direction by evaluating peduncle proximity and accessibility, ensuring minimal damage during separation and consistent harvesting performance. To further enhance robustness in cases where the 3D approach fails due to severe occlusion, missing depth data, or segmentation noise, a complementary 2D fallback strategy is introduced. When a failure is detected, the system extracts the corresponding 2D RGB image and switches to image-based inference. Using SegFormer—an advanced transformer-based semantic segmentation network—the 2D image is segmented into high-fidelity grape and peduncle classes. The 2D GPPLA algorithm then computes picking points in the image space using shape heuristics and spatial priors, and subsequently projects them into the 3D point cloud through depth-aligned pixel mapping. This fallback mechanism enhances resilience in cluttered and partially observable environments while leveraging the richer texture and color cues available in RGB images to compensate for limitations in point-cloud resolution and sensor noise. The proposed method is evaluated on a custom dataset comprising 1,847 grape clusters collected under natural orchard conditions. The 3D GPPLA achieves a picking-point localization success rate of 89.11%. In particular, the algorithm attains success rates of 98.81% in single-cluster scenarios and 80.95% in multi-cluster arrangements, highlighting its adaptability to varying levels of structural complexity. When combined with the 2D fallback strategy, the system demonstrates high overall reliability, significantly reducing failure cases in cluttered and occluded scenarios. By integrating advanced 3D semantic segmentation, adaptive multi-stage clustering, and cross-modal compensation, the proposed framework achieves accurate, stable, and efficient picking-point localization in unstructured vineyard environments. This work lays a solid technical foundation for the practical deployment of grape-harvesting robots and contributes meaningfully to the broader advancement of smart agriculture and robotic fruit-picking systems.