Abstract:
An accurate and rapid estimation of apple tree depth can be widely applied to the precise fruit positioning and robot autonomous harvesting in recent years. In this study, an improved High-Resolution Network (HRNet) was proposed to estimate the monocular depth of apple tree in the real scene. The actual requirements of the depth were obtained from a single RGB image for the apple mechanized picking. Firstly, a multi-branch parallel encoder network was constructed to extract the multi-scale features using the HRNet. A dense connection mechanism was introduced to enhance the continuity in the feature transfer process. Secondly, the Convolutional Block Attention Module (CBAM) was used to recalibrate the fused feature maps at the channel and pixel levels, in order to reduce the noise interference that caused by redundant features. Furthermore, the different weight distributions of the feature maps were effectively learned to enhance the structure information. In the decoder network, the Stripe Refinement Module (SRM) was used to gather the boundary pixels in the horizontal and vertical orthogonal directions. The boundary details of the feature map were adaptively optimized to highlight the edge features. As such, the blurry edge was reduced in the predicted images. Finally, the up-sampling was utilized to generate the prediction depth images of the same size as the RGB images. An image acquisition platform was constructed to collect the RGB and depth images of apple orchards at different times. The data was then enhanced using horizontal mirroring, color jitter, and random rotation. After data enhancement, the 3374 orchard RGB and depth images were obtained for the depth datasets. A series of experiments were also conducted on the NYU Depth V2 dataset and the orchard depth dataset. Ablation experiments were firstly performed on the HRNet networks with different degrees of improvement. The predictive performance of different improved networks was improved significantly, compared with the traditional HRNet network. It indicated that the dense connection mechanism, CBAM, and SRM were added to improve the model performance. Secondly, the mean relative error (MRE), root mean square error (RMS), logarithmic mean error, depth edge accuracy error, and edge integrity error of the improved HRNet network on the orchard depth dataset were 0.123, 0.547, 0.051, 3.90 and 10.59, respectively, compared with the current mainstream networks. The accuracy reached 0.850, 0.975 and 0.993 at different thresholds, respectively. More accurate spatial resolution was achieved in the depth map that generated by the improved HRNet network, in terms of subjective vision. The improved network can be expected to better present the depth information distribution of the image, particularly with the clear edges and more texture details. More importantly, the depth information of some small-sized objects was also displayed, indicating the best overall effect closer to the real depth map. The ablation analysis demonstrated the higher effectiveness of depth estimation using the improved network, compared with the subjective and objective ones. The experiment also verified that the proposed network was outperformed for both visual quality and objective measurement on the NYU Depth V2 and the orchard depth dataset. The finding can provide a new idea to obtain depth information in the apple automatic picking machine.