Abstract:
Harvesting high-value crops such as chili peppers remains highly labor-intensive, limiting productivity in protected automation. Although vision-based intelligent harvesting robots offer a promising solution, their practical deployment is constrained by the difficulty of achieving high detection precision in visually complex greenhouse environments and the limited computational resources of embedded edge devices. To address this critical challenge, this study proposed LDH-YOLOv11n, a lightweight object detection model specifically engineered to achieve high precision and low computational complexity for real-time chili pepper detection in practical greenhouse settings. A comprehensive, self-constructed image dataset served as the foundation of this work. Images were captured across different developmental stages of chili peppers, ranging from early green fruits to fully mature red fruits. All images were then manually annotated using Labelme, and the dataset was partitioned into training, validation, and test sets following a 7:1:2 ratio to prevent data leakage and ensure unbiased model evaluation. To enhance model generalization and improve performance under diverse environmental conditions, data augmentation techniques, including random Gaussian noise, brightness adjustment, geometric transformations (flipping and rotation), and color saturation adjustment, were applied, resulting in a finalized dataset of
8940 images. The proposed LDH-YOLOv11n architecture extended the baseline YOLOv11n model through three key innovations. First, the Similarity-Aware Activation Module (SimAM) was embedded into the C3k2 module to improve the model’s attention to salient features of the chili peppers by reducing distraction from cluttered visual contexts, such as overlapping leaves and branches. Second, several standard convolutional downsampling operations were replaced with the Average Pooling Downsampling (ADown) module, which effectively cut computational cost while maintaining sufficient feature fidelity for accurate detection. Third, the original detection head was replaced with a custom-designed Lightweight Detection Head (LDH-Detect), reducing overall model redundancy without compromising detection precision. Extensive experiments were conducted to benchmark LDH-YOLOv11n against mainstream detection algorithms, including Faster R-CNN, SSD-300, and a comprehensive series of YOLO variants (v3-tiny, v5n, v6n, v8n, v10n, v11n, and v12n). On the custom greenhouse chili dataset, the LDH-YOLOv11n model achieved a precision of 94.3%, a recall of 90.1%, and mAP
50-95 of 77.0%, with only 1.6 million parameters and 3.9 Giga floating-point operations per second (GFLOPs). These results represented significant improvements over the YOLOv11n baseline, with gains of 1.0, 2.2, and 2.1 percentage points in precision, recall, and mAP
50-95, respectively, while simultaneously reducing the model’s parameter count by 38.5% and its GFLOPs by 38.1%. Qualitative evaluations further demonstrated the model’s robustness across four representative and challenging scenarios: standard illumination, fruit overlap, branch and leaf occlusion, and low-light conditions. LDH-YOLOv11n was the only model achieving zero false positives and zero missed detections across all four conditions, outperforming all other models. Furthermore, deployment tests on an embedded edge device equipped with TensorRT yielded an inference speed of 264.6 frames per second (FPS), a 3.58-fold improvement over the non-accelerated version and far exceeding the 30 FPS benchmark for real-time performance. In conclusion, the proposed LDH-YOLOv11n model represents a practically deployable lightweight detection model that effectively reconciles the competing requirements of high detection precision and low computational complexity. Its robustness, precision, and efficiency position it as a potent candidate for accelerating the deployment of intelligent harvesting systems and advancing precision agriculture.